In this analysis we would like to predict the count of bike rentals in Washington DC with a regression analysis, Decision Trees, Support Vector Machines (SVMs) and Neural Networks (NN).

0. Import Packages

Before the project starts, all required packages are imported.

library(ggplot2)
library(e1071)
library(mgcv)
library(corrplot)
library(tree)
library(mgcv)
library(tidyverse)
library(boot)
library(MASS)
library(ipred)
library(randomForest)
library(gbm)
library(modelr)
library(neuralnet)
library(h2o)
library(bit64)
library(caret)

1. Getting Data

Data source

We downloaded the dataset “bikesharing.csv” from https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Bike%20Sharing.

Loading the data

In a first step the dataset is imported to R and stored in the data.frame d.bike:

d.bike <- read.csv("bikesharing.csv", header=TRUE)

Describing the dataset

The dataset contains hourly information about a day and its weather conditions in 17379 observations of 17 variables. In the following, the individual attributes will be explained:

  • instant: record index
  • dteday: Date
  • season: 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • yr: Year; 0: 2011, 1: 2012
  • mnth: Month; 1 to 12
  • hr: Hour; 0 to 23
  • holiday: 0 = no holiday, 1 = holiday
  • weekday: Day of the week; 1 to 7
  • workingday: 0 = weekend or holiday, 1 = working day
  • weathersit:
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp: Normalized temperature in Celsius; The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39;
  • atemp: Normalized feeling temperature in Celsius; Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50
  • hum: Normalized relative humidity; The values are divided to 100 (max);
  • windspeed: Normalized windspeed ; The values are divided to 67 (max)
  • casual: number of non-registered (eg. tourists)
  • registered: number of registered users (eg. workers that use bikes daily)
  • cnt: number of total rentals

Investigating the correlation between all predictors

In order to make a valid feature selection in the next step, we now want to investigate the correlation between the predictors. The so-called corrplot offers a simple and clear display. This shows a correlation matrix. Correlations are represented by dots. The exact value can be read off the scale.

predictors = c("dteday", "season", "yr","mnth","hr","holiday","weekday",
             "workingday","weathersit","temp","atemp","hum","windspeed", "casual", "registered")
corrplot(cor(data.matrix(d.bike[predictors])))

As the corrplot shows, dteday correlates strongly with the predictor yr. Furthermore it can be seen that there is a correlation between dteday and season as well as mnth. There is also a relatively strong correlation between month and season.The strongest correlation of almost 1 shows temp and atemp.

Feature Selection

We decide to delete the feature instant and dteday. instant is a record index that has been artificially created. Therefore it will have no relevant effect on cnt. dteday is the date of the record. It correlates with the three other attributes season, yr and mnth. The features season and mnth as well as temp and atem will be kept for now.

d.bike <- subset(d.bike, select=-c(instant, dteday))

Checking the data

head(d.bike)
##   season yr mnth hr holiday weekday workingday weathersit temp  atemp  hum
## 1      1  0    1  0       0       6          0          1 0.24 0.2879 0.81
## 2      1  0    1  1       0       6          0          1 0.22 0.2727 0.80
## 3      1  0    1  2       0       6          0          1 0.22 0.2727 0.80
## 4      1  0    1  3       0       6          0          1 0.24 0.2879 0.75
## 5      1  0    1  4       0       6          0          1 0.24 0.2879 0.75
## 6      1  0    1  5       0       6          0          2 0.24 0.2576 0.75
##   windspeed casual registered cnt
## 1    0.0000      3         13  16
## 2    0.0000      8         32  40
## 3    0.0000      5         27  32
## 4    0.0000      3         10  13
## 5    0.0000      0          1   1
## 6    0.0896      0          1   1
tail(d.bike)
##       season yr mnth hr holiday weekday workingday weathersit temp  atemp  hum
## 17374      1  1   12 18       0       1          1          2 0.26 0.2727 0.48
## 17375      1  1   12 19       0       1          1          2 0.26 0.2576 0.60
## 17376      1  1   12 20       0       1          1          2 0.26 0.2576 0.60
## 17377      1  1   12 21       0       1          1          1 0.26 0.2576 0.60
## 17378      1  1   12 22       0       1          1          1 0.26 0.2727 0.56
## 17379      1  1   12 23       0       1          1          1 0.26 0.2727 0.65
##       windspeed casual registered cnt
## 17374    0.1343     10        112 122
## 17375    0.1642     11        108 119
## 17376    0.1642      8         81  89
## 17377    0.1642      7         83  90
## 17378    0.1343     13         48  61
## 17379    0.1343     12         37  49

As it looks like the data set was imported completely. To check if there are any missing values (not available, NA) in this dataset we count the number of NAs in the data set.

sum(is.na(d.bike))
## [1] 0
mean(is.na(d.bike))
## [1] 0

The record contains no rows with missing data.

Creating training and test data set

At this point the data set d.bike is divided into training and test set. The training set should consist of 75% of the data and the test set of 25% of the data.

d.bike.train.id <- sample(seq_len(nrow(d.bike)),size = floor(0.75*nrow(d.bike)))
d.bike.train <- d.bike[d.bike.train.id,]
d.bike.test <- d.bike[-d.bike.train.id,]

2. Explorative Data Analysis

In this section each attribute is considered to make certain assumptions. Different statistical diagrams are used for this purpose. The effects of the different attributes on cnt are tested using ANOVA for categorical attributes and a linear regression model for numerical values. The aim of this explorative data analysis is to get a starting point for the model building.

To keep our documentation clear, not all graphical analyses are shown. For example, interactions between predictors are not shown in this document but only described. At these points, references are made to the R-file.

Count

In the very first step we would like to look at the distribution of the response variable count. For this purpose we create a histogram.

ggplot(data = d.bike, aes(x=cnt)) +
  geom_histogram(bins=30, colour="black", ylab="Frequency") + xlab("Count") + ylab("Frequency")
## Warning: Ignoring unknown parameters: ylab

mean(d.bike$cnt)
## [1] 189.4631
sd(d.bike$cnt)
## [1] 181.3876

As we can see, this is a right skewed gamma distribution. So in most hours - the number of bike rentals are hourly observations - between 0 and 30 bikes (the most left bar) are rented.

Bootstrapping:

mean(d.bike$cnt)
## [1] 189.4631
id <- sample(1:length(d.bike$cnt), replace = TRUE)
mean(d.bike$cnt[id])
## [1] 188.984
B <- 1000
t.mean <- c()
for(i in 1:B){
  t.id <- sample(1:length(d.bike$cnt), replace = TRUE)
  t.d.bike <- d.bike$cnt[t.id]
  t.mean[i] <- mean(t.d.bike)
}
length(t.mean)
## [1] 1000
hist(t.mean, breaks = 50)
abline(v = mean(d.bike$cnt), col = "red")

sorted.means <- sort(t.mean)
quantile(sorted.means, probs = c(0.025, 0.975))
##     2.5%    97.5% 
## 186.8310 191.9523

Season

If we want to visualize the effect of the season as a categorical variable on the number of bike rentals it is a good idea to use a box plot.

ggplot(data = d.bike, aes(group=season, y = cnt, x = as.factor(season))) +
  geom_boxplot() + xlab("Season") + ylab("Count")

As the box plot shows, season 1 (spring) is particularly different from season 2, 3 and 4 (summer, autumn and winter). The median is lowest in spring, even lower than in winter. In summer, autumn and winter, the medians differ only marginally. Since the medians differ partially, one can assume a slight effect. However, we want to examine this more closely.

With an Analysis of Variance (ANOVA) we check whether at least one level of Seasons has a significant effect on the number of bike rentals. The first step is to create a model that does not consider any effect from season. With an F-test we compare then both models with each other.

lm.season.1 <- lm(cnt ~ as.factor(season), data = d.bike)
lm.season.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.season.0, lm.season.1)
## Analysis of Variance Table
## 
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(season)
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1  17378 571761591                                  
## 2  17375 534032233  3  37729358 409.18 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see from the p-value there is strong evidence that at least one level in season has a strong effect on the number of bike rentals. The lower value of the residual sums of squares in lm.season.1 also indicates that the more complex model has smaller residuals than the less complex model lm.season.0.

Year

Since we also consider year to be a categorical variable, a box plot is again a good choice. For all categorical variables we will use a boxplot for the graphical representation in the following.

ggplot(data = d.bike, aes(group=yr, y = cnt, x = yr)) +
  geom_boxplot()

At first glance we can see that the number of bike rentals has increased in year 1 (2012). The median is above the median of year 0 (2011).

lm.year.1 <- lm(cnt ~ as.factor(yr), data = d.bike)
lm.year.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.year.0, lm.year.1)
## Analysis of Variance Table
## 
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(yr)
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1  17378 571761591                                  
## 2  17377 535884870  1  35876722 1163.4 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value shows strong evidence that one of the two years 2010 and 2011 have a significant effect on cnt. Furthermore, the RSS shows that the more complex model lm.year.1 has significantly smaller residuals than the simpler one.

Month

ggplot(data = d.bike, aes(group=mnth, y = cnt, x = as.factor(mnth))) +
  geom_boxplot() + scale_x_discrete(limits=seq(1,12)) + xlab("Month") + ylab("Count")

Here we see that the bike rental increased during the summer months May to September or when the temperatures were more pleasant. Visually, the months seem to have an effect on cnt.

lm.month.1 <- lm(cnt ~ as.factor(mnth), data = d.bike)
lm.month.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.month.0, lm.month.1)
## Analysis of Variance Table
## 
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(mnth)
##   Res.Df       RSS Df Sum of Sq     F    Pr(>F)    
## 1  17378 571761591                                 
## 2  17367 528851615 11  42909976 128.1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value shows, there is strong evidence that at least one month must have a significant effect on cnt. Also the RSS shows little surprising. The more complex model lm.month.1 seems to represent the data better than the model lm.month.0.

Hour

ggplot(data = d.bike, aes(group=hr, y = cnt, x = as.factor(hr))) +
  geom_boxplot() + scale_x_discrete(limits=seq(1,23)) + xlab("Hour") + ylab("Count")
## Warning: Removed 726 rows containing missing values (stat_boxplot).

The same applies to the time of day and hour. Bikes were most often rented in the morning between 7 and 8 am. and in the evening between 5 and 6 pm. The same flow also applies to the median. The individual times of day seem to differ greatly from one another.

lm.hr.1 <- lm(cnt ~ as.factor(hr), data = d.bike)
lm.hr.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.hr.0, lm.hr.1)
## Analysis of Variance Table
## 
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(hr)
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1  17378 571761591                                  
## 2  17355 285026910 23 286734681 759.09 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value also reflects this. At least one time of day has a significant effect on cnt. The RSS differs here extremely strongly between complex and simple models. The time of day will probably play a role in the final regression model.

Holiday

ggplot(data = d.bike, aes(group=holiday, y = cnt, x = as.factor(holiday))) +
  geom_boxplot() + xlab("Holiday") + ylab("Count")

During the holiday period, fewer bikes are rented according to the boxplot diagram. However, the difference is very pleasant compared to the normal working period. The holiday period median is close to the work time median.

lm.holiday.1 <- lm(cnt ~ as.factor(holiday), data = d.bike)
lm.holiday.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.holiday.0, lm.holiday.1)
## Analysis of Variance Table
## 
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(holiday)
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1  17378 571761591                                  
## 2  17377 571214702  1    546889 16.637 4.546e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value shows strong evidence that no holiday and/or holiday has a significant effect on cnt. However, if we look at RSS at this point, which differs only minimally between simple and complex model, holiday could be a feature that will be not included in the model.

Weekday

ggplot(data = d.bike, aes(group=weekday, y= cnt, x = as.factor(weekday))) +
  geom_boxplot() + xlab("Weekday") + ylab("Count")

During all seven days of the week, the bike rental is almost constant. On Monday and Sunday there are less bikes rented. The median is lowest on Monday and highest on Saturday. On all other days the median is similar.

lm.weekday.1 <- lm(cnt ~ as.factor(weekday), data = d.bike)
lm.weekday.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.weekday.0, lm.weekday.1)
## Analysis of Variance Table
## 
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(weekday)
##   Res.Df       RSS Df Sum of Sq      F   Pr(>F)   
## 1  17378 571761591                                
## 2  17372 571073662  6    687929 3.4878 0.001899 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value shows, there is only medium evidence that at least one weekday has a significant effect on cnt. Also the RSS does not differ too much between simple and complex model. This leads to the conclusion that probably the weekday is not included in the final model.

Working day

ggplot(data = d.bike, aes(group=workingday, y = cnt, x = as.factor(workingday))) +
  geom_boxplot() + xlab("Working day") + ylab("Count")

On working or not-working days, the count of the bike rentals are almost equal. This is also identical with the both medians.

lm.workingday.1 <- lm(cnt ~ as.factor(workingday), data = d.bike)
lm.workingday.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.workingday.0, lm.workingday.1)
## Analysis of Variance Table
## 
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(workingday)
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1  17378 571761591                                  
## 2  17377 571237204  1    524387 15.952 6.524e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value shows, there is strong evidence that the no working day and/or working day has a significant effect on cnt. Again, the RSS does not differ very much. Working day possibly brings no benefit in the model and will probably be omitted.

Weather Situation

ggplot(data = d.bike, aes(group=weathersit, y = cnt, x = as.factor(weathersit))) +
  geom_boxplot()  + xlab("Weather Situation") + ylab("Count")

If we look at the weather divided on a scale between 1 and 4 as shown in the boxplot, we see that the bike rental is steadily decreasing from 1 to 4. The value 1 is the good weather and 4 the bad weather. Whereby from the value 3 away, probably already bad weather, the renting goes down faster. An identical flow can also be observed for the medians, starting from the first box plot 1 to 4.

lm.weathersit.1 <- lm(cnt ~ as.factor(weathersit), data = d.bike)
lm.weathersit.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.weathersit.0, lm.weathersit.1)
## Analysis of Variance Table
## 
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(weathersit)
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1  17378 571761591                                  
## 2  17375 559476561  3  12285030 127.17 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is strong evidence that at least one weather situation has a significant effect on cnt. The RSS differs greatly between the two models. This will probably be a feature that makes it into the final model.

Temperature

ggplot(data = d.bike, mapping = aes(y = cnt, x = temp)) +
  geom_point()  + xlab("Normalized Temp") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The representation of the regression line confirms the positive correlation. As the smoothing spline shows, temperature is a more complex relationship than linearity. It is possible that a Generalised Additive Model (GAM) could be used if temp is considered in the final model. As it seems higher temperatures lead to a higher number of rented bikes.

library(mgcv)
gam.temp.1 <- gam(cnt ~ s(temp), data = d.bike)
summary(gam.temp.1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cnt ~ s(temp)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  189.463      1.252   151.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(temp) 8.747  8.975 403.3  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.172   Deviance explained = 17.3%
## GCV =  27250  Scale est. = 27235     n = 17379

First, a relatively high estimated degree of freedom (edf) of 8,747 is striking. This indicates the high complexity of the relationship. The p-value shows strong evidence that the slope is not 0 and thus there is a correlation between temp and cnt. The adjusted R-Squared shows that 17 percent of the variability is explained by the data. Thus temp could play a role in the final model. temp shows interaction effects with all of the categorical variables except the yr (see line 142, group13_bikesharing.R).

Feeling Temperature

ggplot(data = d.bike, mapping = aes(y = cnt, x = atemp)) +
  geom_point() + xlab("Normalized Feeling Temp") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Not surprisingly, atemp is also related to a higher degree of complexity. So again a GAM is useful to investigate the effect of atemp on cnt.

gam.atemp.1 <- gam(cnt ~ s(atemp), data = d.bike)
summary(gam.atemp.1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cnt ~ s(atemp)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  189.463      1.253   151.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(atemp) 7.556  8.338 427.3  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.17   Deviance explained = 17.1%
## GCV =  27315  Scale est. = 27301     n = 17379

The edf again shows a more complex relationship between atemp and cnt. The p-value shows strong evidence that the slope is not 0. The R-Squared shows that the model explains 17% of the data. Therefore, atemp could also play a role in the final model. atemp shows interaction effects with all of the categorical variables except the yr (see line 160, group13_bikesharing.R).

Humidity

ggplot(data = d.bike, mapping = aes(y = cnt, x = hum)) +
  geom_point() + xlab("Humidity") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The smoothing spline shows a possible quadratic effect of hum on cnt. Once again, a GAM is the obvious choice.

gam.hum.1 <- gam(cnt ~ s(hum), data = d.bike)
summary(gam.hum.1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cnt ~ s(hum)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  189.463      1.298     146   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(hum) 7.872  8.548 252.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.111   Deviance explained = 11.1%
## GCV =  29281  Scale est. = 29266     n = 17379

As the hypothesis test shows, there is strong evidence that hum has an effect on cnt. Furthermore, the R-Squared shows that the model explains 11% of the data. Thus, hum could also be relevant for the final model. hum shows interaction effects with all of the categorical variables (see line 178, group13_bikesharing.R).

Windspeed

ggplot(data = d.bike, mapping = aes(y = cnt, x = windspeed)) +
  geom_point() + xlab("Windspeed") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The graph shows that windspeed might have a linear effect on cnt. Furthermore the smoothing spline shows that there is very little influence of windspeed, because the curve is almost flat.

lm.windspeed.1 <- lm(cnt ~ windspeed, data = d.bike)
summary(lm.windspeed.1)
## 
## Call:
## lm(formula = cnt ~ windspeed, data = d.bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -265.47 -146.00  -48.75   90.25  806.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  163.185      2.532   64.46   <2e-16 ***
## windspeed    138.233     11.198   12.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 180.6 on 17377 degrees of freedom
## Multiple R-squared:  0.008693,   Adjusted R-squared:  0.008635 
## F-statistic: 152.4 on 1 and 17377 DF,  p-value: < 2.2e-16

The wind speed has a positive effect on cnt. This is surprising, since the observed data seems to decrease with increasing wind speed. As R-Squared shows, this regression model explains only 0.8% of the data. This indicates that the attribute can be omitted in the final model.

Casual Users and registered users

ggplot(data = d.bike, mapping = aes(y = cnt, x = casual)) +
  geom_point() + xlab("Casual Users") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

gam.casual.1 <- gam(cnt ~ s(casual), data = d.bike)
summary(gam.casual.1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cnt ~ s(casual)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 189.4631     0.8841   214.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df    F p-value    
## s(casual) 8.928  8.998 2748  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.587   Deviance explained = 58.7%
## GCV =  13591  Scale est. = 13583     n = 17379
ggplot(data = d.bike, mapping = aes(y = cnt, x = registered)) +
  geom_point() + xlab("Registered Users") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

lm.registered.1 <- lm(cnt ~ registered, data = d.bike)
summary(lm.registered.1)
## 
## Call:
## lm(formula = cnt ~ registered, data = d.bike)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -118.031  -16.264   -9.957    4.730  304.223 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.296458   0.459718    22.4   <2e-16 ***
## registered   1.165032   0.002131   546.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.51 on 17377 degrees of freedom
## Multiple R-squared:  0.9451, Adjusted R-squared:  0.9451 
## F-statistic: 2.99e+05 on 1 and 17377 DF,  p-value: < 2.2e-16

Both casual and registered users show a positive effect and explain a large part of the data with the model. This suggests that casual and registered users need not be considered separately. Furthermore, casual and registered users are not further distinguished and are therefore not considered in the model.

Model development

In this section, the insights gained from the exploratory data analysis and the consideration of the individual attributes should help to form a hypothesis about what the final model might look like. This hypothesis is the starting point for developing the model.

The explorative data analysis has shown that the attributes season, yr, mnth, hr, weathersit, temp, atemp and hum can probably be considered in the model. From this, the following model can be assumed:

cnt = season + yr + mnth + hr + weathersit + temp + atemp + hum + season:temp + season:atemp + season:hum + yr:hum + mnth:temp + mnth:atemp + mnth:hum + hr:temp + hr:atemp + hr:hum + weathersit:temp + weathersit:atemp + weathersit:hum 

First five models are initialized. The full model with all predictors, the starting model with all interactions, the starting model without considering the interactions, the starting model without considering the polynomial effect and the starting model without considering polynomial and interaction effects.

full.model.1 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(holiday) + as.factor(weekday) + as.factor(workingday) + as.factor(weathersit) + temp + atemp + hum + windspeed + casual + registered

starting.model.1 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + poly(hum, degree = 8.7) + poly(temp, degree = 8) + poly(atemp, degree = 8.9) + season:temp + season:atemp + season:hum + yr:hum + mnth:temp + mnth:atemp + mnth:hum + hr:temp + hr:atemp + hr:hum + weathersit:temp + weathersit:atemp + weathersit:hum

starting.model.2 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + poly(hum, degree = 8.7) + poly(temp, degree = 8) + poly(atemp, degree = 8.9)

starting.model.3 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + temp + atemp + hum + season:temp + season:atemp + season:hum + yr:hum + mnth:temp + mnth:atemp + mnth:hum + hr:temp + hr:atemp + hr:hum + weathersit:temp + weathersit:atemp + weathersit:hum

starting.model.4 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + temp + atemp + hum

To find out which of the four starting models fits the data better, the results of the four models are first compared.

lm.starting.model.1 <- lm(starting.model.1, data = d.bike)
lm.starting.model.2 <- lm(starting.model.2, data = d.bike)
lm.starting.model.3 <- lm(starting.model.3, data = d.bike)
lm.starting.model.4 <- lm(starting.model.4, data = d.bike)
summary(lm.starting.model.1)$adj.r.squared
## [1] 0.7467235
summary(lm.starting.model.2)$adj.r.squared
## [1] 0.6943448
summary(lm.starting.model.3)$adj.r.squared
## [1] 0.7414885
summary(lm.starting.model.4)$adj.r.squared
## [1] 0.6837583

If only the adjusted R-Squared is considered at this point, it is shown that the starting.model.1 with a value of about 0.747 explains the variability of the data better than the other starting models. Thus we take the starting.model.1 as a starting point for the model development.

summary(lm.starting.model.1)
## 
## Call:
## lm(formula = starting.model.1, data = d.bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -409.85  -50.61   -3.05   41.55  431.06 
## 
## Coefficients: (4 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -1.504e+02  2.732e+01  -5.505 3.74e-08 ***
## as.factor(season)2          4.552e+01  2.008e+01   2.267 0.023428 *  
## as.factor(season)3          1.696e+02  4.735e+01   3.581 0.000343 ***
## as.factor(season)4          1.593e+02  2.250e+01   7.080 1.50e-12 ***
## as.factor(yr)1              1.729e+02  5.050e+00  34.232  < 2e-16 ***
## as.factor(mnth)2            1.749e+01  1.574e+01   1.111 0.266646    
## as.factor(mnth)3           -2.784e+01  1.830e+01  -1.521 0.128260    
## as.factor(mnth)4           -1.754e+00  3.004e+01  -0.058 0.953434    
## as.factor(mnth)5            4.633e+01  3.951e+01   1.173 0.240950    
## as.factor(mnth)6            5.654e+01  4.732e+01   1.195 0.232197    
## as.factor(mnth)7           -1.583e+01  7.914e+01  -0.200 0.841505    
## as.factor(mnth)8           -9.061e+01  6.983e+01  -1.298 0.194425    
## as.factor(mnth)9           -3.648e+01  5.214e+01  -0.700 0.484183    
## as.factor(mnth)10          -5.008e+01  3.398e+01  -1.474 0.140532    
## as.factor(mnth)11          -1.219e+02  2.957e+01  -4.123 3.76e-05 ***
## as.factor(mnth)12          -8.407e+01  2.251e+01  -3.735 0.000188 ***
## as.factor(hr)1             -5.915e+00  2.355e+01  -0.251 0.801717    
## as.factor(hr)2             -2.188e+01  2.367e+01  -0.924 0.355258    
## as.factor(hr)3             -3.934e+01  2.450e+01  -1.606 0.108363    
## as.factor(hr)4             -5.036e+01  2.494e+01  -2.019 0.043481 *  
## as.factor(hr)5             -3.047e+01  2.457e+01  -1.240 0.214877    
## as.factor(hr)6             -8.563e+00  2.427e+01  -0.353 0.724207    
## as.factor(hr)7              5.986e+01  2.434e+01   2.459 0.013928 *  
## as.factor(hr)8              2.063e+02  2.401e+01   8.593  < 2e-16 ***
## as.factor(hr)9              1.239e+02  2.338e+01   5.297 1.19e-07 ***
## as.factor(hr)10             8.564e+01  2.309e+01   3.708 0.000209 ***
## as.factor(hr)11             1.038e+02  2.287e+01   4.539 5.69e-06 ***
## as.factor(hr)12             1.458e+02  2.285e+01   6.379 1.83e-10 ***
## as.factor(hr)13             1.507e+02  2.295e+01   6.570 5.19e-11 ***
## as.factor(hr)14             1.421e+02  2.304e+01   6.170 6.98e-10 ***
## as.factor(hr)15             1.515e+02  2.304e+01   6.574 5.02e-11 ***
## as.factor(hr)16             1.511e+02  2.295e+01   6.584 4.72e-11 ***
## as.factor(hr)17             1.849e+02  2.281e+01   8.104 5.68e-16 ***
## as.factor(hr)18             1.380e+02  2.259e+01   6.107 1.03e-09 ***
## as.factor(hr)19             8.509e+01  2.265e+01   3.757 0.000173 ***
## as.factor(hr)20             4.316e+01  2.273e+01   1.898 0.057657 .  
## as.factor(hr)21             2.795e+01  2.291e+01   1.220 0.222417    
## as.factor(hr)22             1.631e+01  2.304e+01   0.708 0.479119    
## as.factor(hr)23             1.517e+00  2.333e+01   0.065 0.948171    
## as.factor(weathersit)2     -1.883e+01  8.513e+00  -2.212 0.027012 *  
## as.factor(weathersit)3      4.452e+00  2.243e+01   0.199 0.842651    
## as.factor(weathersit)4     -9.061e+01  2.980e+02  -0.304 0.761043    
## poly(hum, degree = 8.7)1    2.763e+03  9.508e+02   2.906 0.003669 ** 
## poly(hum, degree = 8.7)2   -1.021e+03  1.348e+02  -7.569 3.94e-14 ***
## poly(hum, degree = 8.7)3   -1.429e+02  1.238e+02  -1.154 0.248464    
## poly(hum, degree = 8.7)4    2.572e+02  1.103e+02   2.332 0.019728 *  
## poly(hum, degree = 8.7)5   -3.879e+01  1.036e+02  -0.375 0.707961    
## poly(hum, degree = 8.7)6   -6.089e+01  9.550e+01  -0.638 0.523777    
## poly(hum, degree = 8.7)7    9.659e+01  9.376e+01   1.030 0.302964    
## poly(hum, degree = 8.7)8   -3.870e+02  9.332e+01  -4.147 3.39e-05 ***
## poly(temp, degree = 8)1     1.181e+04  7.092e+03   1.665 0.095835 .  
## poly(temp, degree = 8)2    -2.931e+03  9.130e+02  -3.211 0.001326 ** 
## poly(temp, degree = 8)3    -2.470e+03  5.619e+02  -4.395 1.11e-05 ***
## poly(temp, degree = 8)4    -8.040e+02  3.172e+02  -2.534 0.011270 *  
## poly(temp, degree = 8)5     1.081e+02  2.761e+02   0.392 0.695426    
## poly(temp, degree = 8)6     4.461e+02  2.137e+02   2.087 0.036884 *  
## poly(temp, degree = 8)7     1.532e+02  1.409e+02   1.088 0.276681    
## poly(temp, degree = 8)8    -5.470e+01  1.285e+02  -0.426 0.670404    
## poly(atemp, degree = 8.9)1 -1.793e+04  6.725e+03  -2.667 0.007666 ** 
## poly(atemp, degree = 8.9)2  4.792e+02  8.660e+02   0.553 0.580003    
## poly(atemp, degree = 8.9)3  3.371e+02  5.276e+02   0.639 0.522839    
## poly(atemp, degree = 8.9)4 -5.636e+01  3.058e+02  -0.184 0.853746    
## poly(atemp, degree = 8.9)5 -1.515e+02  2.802e+02  -0.541 0.588692    
## poly(atemp, degree = 8.9)6 -1.293e+02  2.097e+02  -0.617 0.537502    
## poly(atemp, degree = 8.9)7  1.892e+02  1.787e+02   1.059 0.289715    
## poly(atemp, degree = 8.9)8  7.118e+01  1.529e+02   0.465 0.641603    
## season1:temp               -5.197e+02  2.002e+02  -2.595 0.009459 ** 
## season2:temp               -3.614e+02  2.364e+02  -1.529 0.126299    
## season3:temp                6.286e+00  2.027e+02   0.031 0.975266    
## season4:temp                       NA         NA      NA       NA    
## season1:atemp               6.683e+02  2.149e+02   3.110 0.001876 ** 
## season2:atemp               5.484e+02  2.615e+02   2.097 0.035975 *  
## season3:atemp              -4.823e+01  2.327e+02  -0.207 0.835807    
## season4:atemp                      NA         NA      NA       NA    
## season1:hum                 4.992e+01  2.686e+01   1.859 0.063096 .  
## season2:hum                -5.991e+00  2.999e+01  -0.200 0.841651    
## season3:hum                 6.459e-01  2.966e+01   0.022 0.982628    
## season4:hum                        NA         NA      NA       NA    
## hum:yr1                    -1.431e+02  7.750e+00 -18.465  < 2e-16 ***
## temp:mnth2                 -5.400e+01  1.523e+02  -0.355 0.722967    
## temp:mnth3                  2.341e+02  1.994e+02   1.174 0.240344    
## temp:mnth4                 -2.552e+02  3.177e+02  -0.803 0.421759    
## temp:mnth5                 -1.068e+02  3.318e+02  -0.322 0.747478    
## temp:mnth6                 -1.233e+02  3.134e+02  -0.393 0.694003    
## temp:mnth7                 -2.349e+02  3.488e+02  -0.674 0.500624    
## temp:mnth8                 -2.143e+02  3.024e+02  -0.709 0.478607    
## temp:mnth9                 -3.485e+02  3.116e+02  -1.118 0.263450    
## temp:mnth10                -1.929e+02  2.911e+02  -0.663 0.507577    
## temp:mnth11                -3.472e+02  2.703e+02  -1.285 0.198961    
## temp:mnth12                -1.495e+02  2.021e+02  -0.740 0.459336    
## atemp:mnth2                 5.027e+01  1.554e+02   0.324 0.746235    
## atemp:mnth3                -8.513e+01  2.038e+02  -0.418 0.676162    
## atemp:mnth4                 3.775e+02  3.326e+02   1.135 0.256345    
## atemp:mnth5                 1.891e+02  3.538e+02   0.534 0.593053    
## atemp:mnth6                 2.316e+02  3.315e+02   0.699 0.484778    
## atemp:mnth7                 4.883e+02  3.606e+02   1.354 0.175647    
## atemp:mnth8                 5.821e+02  3.172e+02   1.835 0.066522 .  
## atemp:mnth9                 7.407e+02  3.366e+02   2.201 0.027763 *  
## atemp:mnth10                5.761e+02  3.106e+02   1.855 0.063657 .  
## atemp:mnth11                6.915e+02  2.852e+02   2.425 0.015327 *  
## atemp:mnth12                4.586e+02  2.092e+02   2.192 0.028369 *  
## hum:mnth2                  -1.004e+01  1.943e+01  -0.517 0.605280    
## hum:mnth3                   6.240e+00  2.028e+01   0.308 0.758322    
## hum:mnth4                  -3.714e+01  2.818e+01  -1.318 0.187555    
## hum:mnth5                  -5.447e+01  3.013e+01  -1.808 0.070613 .  
## hum:mnth6                  -9.128e+01  3.084e+01  -2.960 0.003077 ** 
## hum:mnth7                  -1.021e+02  4.481e+01  -2.277 0.022771 *  
## hum:mnth8                  -1.128e+02  4.084e+01  -2.761 0.005770 ** 
## hum:mnth9                  -1.734e+02  3.597e+01  -4.822 1.43e-06 ***
## hum:mnth10                 -1.339e+02  3.426e+01  -3.910 9.28e-05 ***
## hum:mnth11                 -2.618e+01  3.363e+01  -0.779 0.436265    
## hum:mnth12                 -3.117e+01  2.752e+01  -1.133 0.257308    
## temp:hr1                    8.673e-01  1.758e+02   0.005 0.996063    
## temp:hr2                   -9.537e+01  1.743e+02  -0.547 0.584362    
## temp:hr3                   -1.269e+02  1.779e+02  -0.713 0.475714    
## temp:hr4                   -7.483e+01  1.804e+02  -0.415 0.678289    
## temp:hr5                   -8.529e+01  1.796e+02  -0.475 0.634960    
## temp:hr6                   -5.963e+01  1.770e+02  -0.337 0.736253    
## temp:hr7                   -1.690e+02  1.775e+02  -0.952 0.341036    
## temp:hr8                   -6.394e+01  1.771e+02  -0.361 0.718106    
## temp:hr9                   -7.412e+01  1.726e+02  -0.429 0.667673    
## temp:hr10                  -1.457e+02  1.716e+02  -0.849 0.395837    
## temp:hr11                  -1.314e+02  1.687e+02  -0.779 0.435943    
## temp:hr12                  -1.068e+02  1.645e+02  -0.649 0.516091    
## temp:hr13                  -1.331e+02  1.630e+02  -0.817 0.413986    
## temp:hr14                  -1.027e+02  1.625e+02  -0.632 0.527328    
## temp:hr15                   6.105e+01  1.620e+02   0.377 0.706356    
## temp:hr16                   6.380e+01  1.658e+02   0.385 0.700395    
## temp:hr17                   1.520e+02  1.640e+02   0.927 0.354028    
## temp:hr18                   8.601e+01  1.684e+02   0.511 0.609572    
## temp:hr19                   1.728e+02  1.708e+02   1.012 0.311600    
## temp:hr20                   1.763e+02  1.731e+02   1.018 0.308536    
## temp:hr21                   8.595e+01  1.751e+02   0.491 0.623498    
## temp:hr22                   6.471e+01  1.803e+02   0.359 0.719762    
## temp:hr23                   9.546e+00  1.828e+02   0.052 0.958359    
## atemp:hr1                  -5.446e+01  1.956e+02  -0.278 0.780730    
## atemp:hr2                   3.448e+01  1.948e+02   0.177 0.859520    
## atemp:hr3                   5.586e+01  1.993e+02   0.280 0.779228    
## atemp:hr4                  -7.392e+00  2.020e+02  -0.037 0.970808    
## atemp:hr5                   2.561e+01  2.014e+02   0.127 0.898798    
## atemp:hr6                   1.237e+02  1.983e+02   0.623 0.532972    
## atemp:hr7                   4.808e+02  1.983e+02   2.425 0.015330 *  
## atemp:hr8                   4.041e+02  1.967e+02   2.055 0.039926 *  
## atemp:hr9                   2.373e+02  1.907e+02   1.244 0.213467    
## atemp:hr10                  3.401e+02  1.897e+02   1.793 0.073065 .  
## atemp:hr11                  3.819e+02  1.868e+02   2.045 0.040899 *  
## atemp:hr12                  4.099e+02  1.825e+02   2.247 0.024676 *  
## atemp:hr13                  4.393e+02  1.813e+02   2.423 0.015388 *  
## atemp:hr14                  3.955e+02  1.808e+02   2.187 0.028723 *  
## atemp:hr15                  2.138e+02  1.804e+02   1.185 0.235978    
## atemp:hr16                  3.478e+02  1.848e+02   1.882 0.059841 .  
## atemp:hr17                  5.423e+02  1.830e+02   2.963 0.003046 ** 
## atemp:hr18                  6.125e+02  1.872e+02   3.272 0.001070 ** 
## atemp:hr19                  3.677e+02  1.898e+02   1.937 0.052754 .  
## atemp:hr20                  2.202e+02  1.919e+02   1.147 0.251200    
## atemp:hr21                  1.984e+02  1.939e+02   1.023 0.306220    
## atemp:hr22                  1.238e+02  1.998e+02   0.620 0.535565    
## atemp:hr23                  7.664e+01  2.024e+02   0.379 0.705004    
## hum:hr1                     1.399e+01  2.996e+01   0.467 0.640483    
## hum:hr2                     2.774e+01  3.032e+01   0.915 0.360159    
## hum:hr3                     4.193e+01  3.143e+01   1.334 0.182186    
## hum:hr4                     5.384e+01  3.177e+01   1.695 0.090179 .  
## hum:hr5                     3.279e+01  3.162e+01   1.037 0.299715    
## hum:hr6                     1.058e+01  3.109e+01   0.340 0.733522    
## hum:hr7                    -3.870e+01  3.059e+01  -1.265 0.205856    
## hum:hr8                    -6.984e+01  3.027e+01  -2.307 0.021044 *  
## hum:hr9                    -4.772e+01  2.956e+01  -1.614 0.106495    
## hum:hr10                   -8.687e+01  2.934e+01  -2.961 0.003071 ** 
## hum:hr11                   -1.163e+02  2.919e+01  -3.986 6.76e-05 ***
## hum:hr12                   -1.659e+02  2.915e+01  -5.691 1.28e-08 ***
## hum:hr13                   -1.858e+02  2.915e+01  -6.374 1.89e-10 ***
## hum:hr14                   -1.861e+02  2.895e+01  -6.429 1.32e-10 ***
## hum:hr15                   -1.756e+02  2.875e+01  -6.108 1.03e-09 ***
## hum:hr16                   -1.984e+02  2.858e+01  -6.943 3.97e-12 ***
## hum:hr17                   -2.557e+02  2.823e+01  -9.061  < 2e-16 ***
## hum:hr18                   -2.174e+02  2.830e+01  -7.683 1.64e-14 ***
## hum:hr19                   -1.691e+02  2.821e+01  -5.995 2.08e-09 ***
## hum:hr20                   -1.085e+02  2.834e+01  -3.830 0.000129 ***
## hum:hr21                   -7.307e+01  2.867e+01  -2.548 0.010832 *  
## hum:hr22                   -4.286e+01  2.882e+01  -1.487 0.137069    
## hum:hr23                   -9.168e+00  2.924e+01  -0.314 0.753864    
## temp:weathersit2            2.035e+00  6.183e+01   0.033 0.973736    
## temp:weathersit3            2.983e+01  1.033e+02   0.289 0.772719    
## temp:weathersit4           -4.949e+02  2.870e+03  -0.172 0.863102    
## atemp:weathersit2          -2.220e+01  6.882e+01  -0.323 0.746973    
## atemp:weathersit3          -8.854e+01  1.149e+02  -0.770 0.441079    
## atemp:weathersit4           8.651e+02  2.579e+03   0.335 0.737282    
## hum:weathersit2             3.037e+01  1.055e+01   2.879 0.003989 ** 
## hum:weathersit3            -3.694e+01  2.308e+01  -1.600 0.109568    
## hum:weathersit4                    NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.29 on 17193 degrees of freedom
## Multiple R-squared:  0.7494, Adjusted R-squared:  0.7467 
## F-statistic: 277.9 on 185 and 17193 DF,  p-value: < 2.2e-16

If now the p-value of the individual predictors and the interaction relationships in the output of starting.model.1 are considered, it becomes clear that certain attributes have no significant effect on cnt. The interaction relationships weathersit:atemp, weathersit:temp, hr:atemp and mnth:temp are deleted from the model. In the second iteration the interaction relations and predictors with a weak and a medium effect are removed from the model. These are: mnth:hum, season:temp and weathersit:atemp. The iteration was performed separately in the analysis (see lines 246-303, group13_bikesharing.R). For reasons of clarity, the predictors and interactions are removed in one pass.

##1. iteration
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - weathersit:atemp)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - weathersit:temp)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - hr:atemp)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - mnth:temp)
##2. iteration
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - mnth:hum)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - season:temp)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - weathersit:hum)
summary(lm.starting.model.1)
## 
## Call:
## lm(formula = cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + 
##     as.factor(hr) + as.factor(weathersit) + poly(hum, degree = 8.7) + 
##     poly(temp, degree = 8) + poly(atemp, degree = 8.9) + season:atemp + 
##     season:hum + hum:yr + atemp:mnth + temp:hr + hum:hr, data = d.bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -418.83  -51.19   -2.54   41.54  431.19 
## 
## Coefficients: (3 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -143.267     25.714  -5.572 2.56e-08 ***
## as.factor(season)2            73.324     17.205   4.262 2.04e-05 ***
## as.factor(season)3           298.104     38.588   7.725 1.18e-14 ***
## as.factor(season)4           195.091     16.508  11.818  < 2e-16 ***
## as.factor(yr)1               172.268      4.990  34.521  < 2e-16 ***
## as.factor(mnth)2              10.876     10.721   1.014 0.310367    
## as.factor(mnth)3             -20.106     13.838  -1.453 0.146264    
## as.factor(mnth)4              -9.261     23.624  -0.392 0.695051    
## as.factor(mnth)5              21.410     30.430   0.704 0.481711    
## as.factor(mnth)6               7.077     38.071   0.186 0.852537    
## as.factor(mnth)7             -40.325     50.737  -0.795 0.426750    
## as.factor(mnth)8            -113.138     47.319  -2.391 0.016815 *  
## as.factor(mnth)9            -182.423     40.325  -4.524 6.11e-06 ***
## as.factor(mnth)10           -136.993     23.215  -5.901 3.68e-09 ***
## as.factor(mnth)11           -124.801     20.972  -5.951 2.72e-09 ***
## as.factor(mnth)12            -99.478     15.568  -6.390 1.70e-10 ***
## as.factor(hr)1                -7.430     22.747  -0.327 0.743961    
## as.factor(hr)2               -20.040     22.900  -0.875 0.381518    
## as.factor(hr)3               -37.708     23.616  -1.597 0.110357    
## as.factor(hr)4               -51.305     24.012  -2.137 0.032644 *  
## as.factor(hr)5               -32.269     23.650  -1.364 0.172442    
## as.factor(hr)6                -7.857     23.362  -0.336 0.736631    
## as.factor(hr)7                73.630     23.393   3.148 0.001649 ** 
## as.factor(hr)8               215.553     23.377   9.221  < 2e-16 ***
## as.factor(hr)9               130.726     22.817   5.729 1.02e-08 ***
## as.factor(hr)10               94.937     22.567   4.207 2.60e-05 ***
## as.factor(hr)11              112.811     22.286   5.062 4.19e-07 ***
## as.factor(hr)12              154.479     22.217   6.953 3.70e-12 ***
## as.factor(hr)13              159.391     22.255   7.162 8.27e-13 ***
## as.factor(hr)14              149.476     22.303   6.702 2.12e-11 ***
## as.factor(hr)15              154.276     22.290   6.921 4.63e-12 ***
## as.factor(hr)16              157.121     22.198   7.078 1.52e-12 ***
## as.factor(hr)17              198.580     22.063   9.001  < 2e-16 ***
## as.factor(hr)18              153.956     21.929   7.021 2.29e-12 ***
## as.factor(hr)19               97.181     21.871   4.443 8.91e-06 ***
## as.factor(hr)20               50.504     21.879   2.308 0.020994 *  
## as.factor(hr)21               33.505     22.027   1.521 0.128254    
## as.factor(hr)22               19.421     22.067   0.880 0.378822    
## as.factor(hr)23                5.565     22.301   0.250 0.802963    
## as.factor(weathersit)2        -7.823      1.752  -4.464 8.10e-06 ***
## as.factor(weathersit)3       -54.424      3.076 -17.694  < 2e-16 ***
## as.factor(weathersit)4       -23.638     53.297  -0.444 0.657400    
## poly(hum, degree = 8.7)1     957.555    586.675   1.632 0.102661    
## poly(hum, degree = 8.7)2    -914.302    120.837  -7.566 4.03e-14 ***
## poly(hum, degree = 8.7)3    -177.031     98.917  -1.790 0.073521 .  
## poly(hum, degree = 8.7)4     384.352     99.925   3.846 0.000120 ***
## poly(hum, degree = 8.7)5    -112.904     95.333  -1.184 0.236309    
## poly(hum, degree = 8.7)6     -22.383     93.126  -0.240 0.810065    
## poly(hum, degree = 8.7)7      64.497     92.906   0.694 0.487556    
## poly(hum, degree = 8.7)8    -377.214     92.651  -4.071 4.69e-05 ***
## poly(temp, degree = 8)1    -2134.381   1133.604  -1.883 0.059741 .  
## poly(temp, degree = 8)2    -3474.213    464.949  -7.472 8.26e-14 ***
## poly(temp, degree = 8)3    -2777.949    417.254  -6.658 2.87e-11 ***
## poly(temp, degree = 8)4     -457.491    270.972  -1.688 0.091365 .  
## poly(temp, degree = 8)5      213.518    248.039   0.861 0.389347    
## poly(temp, degree = 8)6      328.755    198.040   1.660 0.096925 .  
## poly(temp, degree = 8)7      177.723    136.542   1.302 0.193071    
## poly(temp, degree = 8)8      -55.473    124.113  -0.447 0.654912    
## poly(atemp, degree = 8.9)1 -1740.835   1685.322  -1.033 0.301647    
## poly(atemp, degree = 8.9)2  1224.997    461.324   2.655 0.007929 ** 
## poly(atemp, degree = 8.9)3   728.013    390.406   1.865 0.062233 .  
## poly(atemp, degree = 8.9)4  -371.555    259.581  -1.431 0.152344    
## poly(atemp, degree = 8.9)5  -294.540    246.098  -1.197 0.231384    
## poly(atemp, degree = 8.9)6     8.645    195.419   0.044 0.964715    
## poly(atemp, degree = 8.9)7   229.686    167.824   1.369 0.171138    
## poly(atemp, degree = 8.9)8    59.002    147.736   0.399 0.689622    
## season1:atemp                 91.303     40.433   2.258 0.023949 *  
## season2:atemp                137.184     37.128   3.695 0.000221 ***
## season3:atemp               -136.185     66.492  -2.048 0.040563 *  
## season4:atemp                     NA         NA      NA       NA    
## season1:hum                  132.432     11.467  11.549  < 2e-16 ***
## season2:hum                   30.183     11.509   2.623 0.008733 ** 
## season3:hum                  -56.517     14.683  -3.849 0.000119 ***
## season4:hum                       NA         NA      NA       NA    
## hum:yr1                     -140.467      7.649 -18.363  < 2e-16 ***
## atemp:mnth2                   -1.972     37.042  -0.053 0.957546    
## atemp:mnth3                  137.136     41.381   3.314 0.000922 ***
## atemp:mnth4                   71.317     60.877   1.172 0.241412    
## atemp:mnth5                   46.588     69.212   0.673 0.500884    
## atemp:mnth6                   76.765     78.079   0.983 0.325535    
## atemp:mnth7                  172.625     92.254   1.871 0.061336 .  
## atemp:mnth8                  272.623     88.129   3.093 0.001981 ** 
## atemp:mnth9                  411.659     80.779   5.096 3.50e-07 ***
## atemp:mnth10                 357.439     61.564   5.806 6.51e-09 ***
## atemp:mnth11                 288.204     60.364   4.774 1.82e-06 ***
## atemp:mnth12                 285.572     50.310   5.676 1.40e-08 ***
## temp:hr0                     -75.269     27.563  -2.731 0.006325 ** 
## temp:hr1                    -123.094     27.696  -4.444 8.87e-06 ***
## temp:hr2                    -139.435     28.057  -4.970 6.77e-07 ***
## temp:hr3                    -153.613     28.643  -5.363 8.29e-08 ***
## temp:hr4                    -157.870     28.775  -5.486 4.16e-08 ***
## temp:hr5                    -138.378     28.590  -4.840 1.31e-06 ***
## temp:hr6                     -27.234     28.181  -0.966 0.333858    
## temp:hr7                     179.174     27.123   6.606 4.06e-11 ***
## temp:hr8                     220.824     26.350   8.380  < 2e-16 ***
## temp:hr9                      65.043     26.066   2.495 0.012594 *  
## temp:hr10                     90.520     26.011   3.480 0.000502 ***
## temp:hr11                    145.219     26.249   5.532 3.21e-08 ***
## temp:hr12                    196.763     26.416   7.449 9.87e-14 ***
## temp:hr13                    197.233     26.662   7.398 1.45e-13 ***
## temp:hr14                    190.124     26.853   7.080 1.49e-12 ***
## temp:hr15                    195.200     26.954   7.242 4.61e-13 ***
## temp:hr16                    315.517     26.923  11.719  < 2e-16 ***
## temp:hr17                    569.891     26.684  21.357  < 2e-16 ***
## temp:hr18                    565.865     26.565  21.301  < 2e-16 ***
## temp:hr19                    434.331     26.655  16.295  < 2e-16 ***
## temp:hr20                    303.636     26.738  11.356  < 2e-16 ***
## temp:hr21                    191.067     26.924   7.097 1.33e-12 ***
## temp:hr22                    100.882     27.066   3.727 0.000194 ***
## temp:hr23                         NA         NA      NA       NA    
## hum:hr1                       13.718     30.007   0.457 0.647545    
## hum:hr2                       26.178     30.329   0.863 0.388074    
## hum:hr3                       42.385     31.456   1.347 0.177856    
## hum:hr4                       54.370     31.816   1.709 0.087483 .  
## hum:hr5                       35.891     31.645   1.134 0.256748    
## hum:hr6                       16.070     31.122   0.516 0.605617    
## hum:hr7                      -32.245     30.647  -1.052 0.292745    
## hum:hr8                      -62.430     30.252  -2.064 0.039064 *  
## hum:hr9                      -46.046     29.542  -1.559 0.119095    
## hum:hr10                     -87.280     29.232  -2.986 0.002833 ** 
## hum:hr11                    -116.807     29.030  -4.024 5.75e-05 ***
## hum:hr12                    -166.615     28.952  -5.755 8.82e-09 ***
## hum:hr13                    -185.503     28.951  -6.408 1.52e-10 ***
## hum:hr14                    -187.703     28.800  -6.518 7.35e-11 ***
## hum:hr15                    -182.570     28.561  -6.392 1.68e-10 ***
## hum:hr16                    -201.586     28.365  -7.107 1.23e-12 ***
## hum:hr17                    -254.651     28.105  -9.061  < 2e-16 ***
## hum:hr18                    -215.800     28.146  -7.667 1.85e-14 ***
## hum:hr19                    -174.028     28.186  -6.174 6.80e-10 ***
## hum:hr20                    -111.621     28.358  -3.936 8.31e-05 ***
## hum:hr21                     -72.597     28.724  -2.527 0.011500 *  
## hum:hr22                     -41.303     28.889  -1.430 0.152820    
## hum:hr23                      -9.223     29.302  -0.315 0.752947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.56 on 17249 degrees of freedom
## Multiple R-squared:  0.7471, Adjusted R-squared:  0.7452 
## F-statistic:   395 on 129 and 17249 DF,  p-value: < 2.2e-16

Now all interactions with weak and medium effect are removed from the model. atemp also shows a medium effect. Since it is present in certain interactions with strong effect, atemp is retained. The adapted R-Squared has only been reduced by 0.0003 points due to the adjustments. The final model now has an adjusted R-Squared of 0.7452, which explains about 74.5% of the data. The final model now looks as follows:

cnt = season + yr + mnth + hr + weathersit + temp + atemp + hum + season:atemp + season:hum + yr:hum + mnth:atemp + hr:temp + hr:hum

The predictors temp, atemp and hum show polynomial effects.

Cross Validation

In this section, the result from the section “Model development” shall be validated by cross-validation. For this purpose, all models that played a role in the development of a final model are compared with each other. The mean R-squared of the individual models, which is calculated from 10-fold cross-validation, serves as a comparison value.

lm.starting.model.1 <- lm(starting.model.1, data = d.bike)
lm.starting.model.2 <- lm(starting.model.2, data = d.bike)
lm.starting.model.3 <- lm(starting.model.3, data = d.bike)
lm.starting.model.4 <- lm(starting.model.4, data = d.bike)
lm.full.model.1 <- lm(full.model.1, data = d.bike)
lm.final.model.1 <- lm(final.model.1, data = d.bike)

for(i in 1:10){
  d.bike.train.id <- sample(seq_len(nrow(d.bike)),size = floor(0.75*nrow(d.bike)))
  d.bike.train <- d.bike[d.bike.train.id,]
  d.bike.test <- d.bike[-d.bike.train.id,]

  #predict data with starting model 1
  lm.starting.model.1.train <- lm(starting.model.1, data = d.bike.train)
  predicted.starting.model.1.test <- predict(lm.starting.model.1.train,
                                             newdata = d.bike.test)
  r.squared.starting.model.1 <- cor(predicted.starting.model.1.test, d.bike.test$cnt)^2
  
  #predict data with starting model 2
  lm.starting.model.2.train <- lm(starting.model.2, data = d.bike.train)
  predicted.starting.model.2.test <- predict(lm.starting.model.2.train,
                                             newdata = d.bike.test)
  r.squared.starting.model.2 <- cor(predicted.starting.model.2.test, d.bike.test$cnt)^2
  
  #predict data with starting model 3
  lm.starting.model.3.train <- lm(starting.model.3, data = d.bike.train)
  predicted.starting.model.3.test <- predict(lm.starting.model.3.train,
                                             newdata = d.bike.test)
  r.squared.starting.model.3 <- cor(predicted.starting.model.3.test, d.bike.test$cnt)^2
  
  #predict data with starting model 4
  lm.starting.model.4.train <- lm(starting.model.4, data = d.bike.train)
  predicted.starting.model.4.test <- predict(lm.starting.model.4.train,
                                             newdata = d.bike.test)
  r.squared.starting.model.4 <- cor(predicted.starting.model.4.test, d.bike.test$cnt)^2
  
  #predict data with full model
  lm.full.model.train <- lm(full.model.1, data = d.bike.train)
  predicted.full.model.test <- predict(lm.full.model.train,
                                       newdata = d.bike.test)
  r.squared.full.model <- cor(predicted.full.model.test, d.bike.test$cnt)^2
  
  #predict data with final model
  lm.final.model.train <- lm(final.model.1, data = d.bike.train)
  predicted.final.model.test <- predict(lm.final.model.train,
                                        newdata = d.bike.test)
  r.squared.final.model <- cor(predicted.final.model.test, d.bike.test$cnt)^2
}

mean(r.squared.starting.model.1)
## [1] 0.7542875
mean(r.squared.starting.model.2)
## [1] 0.7072345
mean(r.squared.starting.model.3)
## [1] 0.7492328
mean(r.squared.starting.model.4)
## [1] 0.6984628
mean(r.squared.full.model)
## [1] 1
mean(r.squared.final.model)
## [1] 0.7535116

As the comparison shows, the full.model.1 according to R-Squared is the best model. However, since the final model is the second best and much less complex than the full model, the result of the model development can be validated. For further analysis steps final.model.1 is used.

3. Regression Analysis

In this chapter the bike rental cnt will be predicted respectively estimated using a linear, a non-linear and a poisson regressive model. The three models will then be compared using a 10-fold cross-validation based on the R-Squared.

Linear Regression

lm.bike.1 <- lm(final.model.1, data = d.bike)
summary(lm.bike.1)
## 
## Call:
## lm(formula = final.model.1, data = d.bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -418.83  -51.19   -2.54   41.54  431.19 
## 
## Coefficients: (3 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -143.267     25.714  -5.572 2.56e-08 ***
## as.factor(season)2            73.324     17.205   4.262 2.04e-05 ***
## as.factor(season)3           298.104     38.588   7.725 1.18e-14 ***
## as.factor(season)4           195.091     16.508  11.818  < 2e-16 ***
## as.factor(yr)1               172.268      4.990  34.521  < 2e-16 ***
## as.factor(mnth)2              10.876     10.721   1.014 0.310367    
## as.factor(mnth)3             -20.106     13.838  -1.453 0.146264    
## as.factor(mnth)4              -9.261     23.624  -0.392 0.695051    
## as.factor(mnth)5              21.410     30.430   0.704 0.481711    
## as.factor(mnth)6               7.077     38.071   0.186 0.852537    
## as.factor(mnth)7             -40.325     50.737  -0.795 0.426750    
## as.factor(mnth)8            -113.138     47.319  -2.391 0.016815 *  
## as.factor(mnth)9            -182.423     40.325  -4.524 6.11e-06 ***
## as.factor(mnth)10           -136.993     23.215  -5.901 3.68e-09 ***
## as.factor(mnth)11           -124.801     20.972  -5.951 2.72e-09 ***
## as.factor(mnth)12            -99.478     15.568  -6.390 1.70e-10 ***
## as.factor(hr)1                -7.430     22.747  -0.327 0.743961    
## as.factor(hr)2               -20.040     22.900  -0.875 0.381518    
## as.factor(hr)3               -37.708     23.616  -1.597 0.110357    
## as.factor(hr)4               -51.305     24.012  -2.137 0.032644 *  
## as.factor(hr)5               -32.269     23.650  -1.364 0.172442    
## as.factor(hr)6                -7.857     23.362  -0.336 0.736631    
## as.factor(hr)7                73.630     23.393   3.148 0.001649 ** 
## as.factor(hr)8               215.553     23.377   9.221  < 2e-16 ***
## as.factor(hr)9               130.726     22.817   5.729 1.02e-08 ***
## as.factor(hr)10               94.937     22.567   4.207 2.60e-05 ***
## as.factor(hr)11              112.811     22.286   5.062 4.19e-07 ***
## as.factor(hr)12              154.479     22.217   6.953 3.70e-12 ***
## as.factor(hr)13              159.391     22.255   7.162 8.27e-13 ***
## as.factor(hr)14              149.476     22.303   6.702 2.12e-11 ***
## as.factor(hr)15              154.276     22.290   6.921 4.63e-12 ***
## as.factor(hr)16              157.121     22.198   7.078 1.52e-12 ***
## as.factor(hr)17              198.580     22.063   9.001  < 2e-16 ***
## as.factor(hr)18              153.956     21.929   7.021 2.29e-12 ***
## as.factor(hr)19               97.181     21.871   4.443 8.91e-06 ***
## as.factor(hr)20               50.504     21.879   2.308 0.020994 *  
## as.factor(hr)21               33.505     22.027   1.521 0.128254    
## as.factor(hr)22               19.421     22.067   0.880 0.378822    
## as.factor(hr)23                5.565     22.301   0.250 0.802963    
## as.factor(weathersit)2        -7.823      1.752  -4.464 8.10e-06 ***
## as.factor(weathersit)3       -54.424      3.076 -17.694  < 2e-16 ***
## as.factor(weathersit)4       -23.638     53.297  -0.444 0.657400    
## poly(hum, degree = 8.7)1     957.555    586.675   1.632 0.102661    
## poly(hum, degree = 8.7)2    -914.302    120.837  -7.566 4.03e-14 ***
## poly(hum, degree = 8.7)3    -177.031     98.917  -1.790 0.073521 .  
## poly(hum, degree = 8.7)4     384.352     99.925   3.846 0.000120 ***
## poly(hum, degree = 8.7)5    -112.904     95.333  -1.184 0.236309    
## poly(hum, degree = 8.7)6     -22.383     93.126  -0.240 0.810065    
## poly(hum, degree = 8.7)7      64.497     92.906   0.694 0.487556    
## poly(hum, degree = 8.7)8    -377.214     92.651  -4.071 4.69e-05 ***
## poly(temp, degree = 8)1    -2134.381   1133.604  -1.883 0.059741 .  
## poly(temp, degree = 8)2    -3474.213    464.949  -7.472 8.26e-14 ***
## poly(temp, degree = 8)3    -2777.949    417.254  -6.658 2.87e-11 ***
## poly(temp, degree = 8)4     -457.491    270.972  -1.688 0.091365 .  
## poly(temp, degree = 8)5      213.518    248.039   0.861 0.389347    
## poly(temp, degree = 8)6      328.755    198.040   1.660 0.096925 .  
## poly(temp, degree = 8)7      177.723    136.542   1.302 0.193071    
## poly(temp, degree = 8)8      -55.473    124.113  -0.447 0.654912    
## poly(atemp, degree = 8.9)1 -1740.835   1685.322  -1.033 0.301647    
## poly(atemp, degree = 8.9)2  1224.997    461.324   2.655 0.007929 ** 
## poly(atemp, degree = 8.9)3   728.013    390.406   1.865 0.062233 .  
## poly(atemp, degree = 8.9)4  -371.555    259.581  -1.431 0.152344    
## poly(atemp, degree = 8.9)5  -294.540    246.098  -1.197 0.231384    
## poly(atemp, degree = 8.9)6     8.645    195.419   0.044 0.964715    
## poly(atemp, degree = 8.9)7   229.686    167.824   1.369 0.171138    
## poly(atemp, degree = 8.9)8    59.002    147.736   0.399 0.689622    
## atemp:season1                 91.303     40.433   2.258 0.023949 *  
## atemp:season2                137.184     37.128   3.695 0.000221 ***
## atemp:season3               -136.185     66.492  -2.048 0.040563 *  
## atemp:season4                     NA         NA      NA       NA    
## season1:hum                  132.432     11.467  11.549  < 2e-16 ***
## season2:hum                   30.183     11.509   2.623 0.008733 ** 
## season3:hum                  -56.517     14.683  -3.849 0.000119 ***
## season4:hum                       NA         NA      NA       NA    
## hum:yr1                     -140.467      7.649 -18.363  < 2e-16 ***
## atemp:mnth2                   -1.972     37.042  -0.053 0.957546    
## atemp:mnth3                  137.136     41.381   3.314 0.000922 ***
## atemp:mnth4                   71.317     60.877   1.172 0.241412    
## atemp:mnth5                   46.588     69.212   0.673 0.500884    
## atemp:mnth6                   76.765     78.079   0.983 0.325535    
## atemp:mnth7                  172.625     92.254   1.871 0.061336 .  
## atemp:mnth8                  272.623     88.129   3.093 0.001981 ** 
## atemp:mnth9                  411.659     80.779   5.096 3.50e-07 ***
## atemp:mnth10                 357.439     61.564   5.806 6.51e-09 ***
## atemp:mnth11                 288.204     60.364   4.774 1.82e-06 ***
## atemp:mnth12                 285.572     50.310   5.676 1.40e-08 ***
## temp:hr0                     -75.269     27.563  -2.731 0.006325 ** 
## temp:hr1                    -123.094     27.696  -4.444 8.87e-06 ***
## temp:hr2                    -139.435     28.057  -4.970 6.77e-07 ***
## temp:hr3                    -153.613     28.643  -5.363 8.29e-08 ***
## temp:hr4                    -157.870     28.775  -5.486 4.16e-08 ***
## temp:hr5                    -138.378     28.590  -4.840 1.31e-06 ***
## temp:hr6                     -27.234     28.181  -0.966 0.333858    
## temp:hr7                     179.174     27.123   6.606 4.06e-11 ***
## temp:hr8                     220.824     26.350   8.380  < 2e-16 ***
## temp:hr9                      65.043     26.066   2.495 0.012594 *  
## temp:hr10                     90.520     26.011   3.480 0.000502 ***
## temp:hr11                    145.219     26.249   5.532 3.21e-08 ***
## temp:hr12                    196.763     26.416   7.449 9.87e-14 ***
## temp:hr13                    197.233     26.662   7.398 1.45e-13 ***
## temp:hr14                    190.124     26.853   7.080 1.49e-12 ***
## temp:hr15                    195.200     26.954   7.242 4.61e-13 ***
## temp:hr16                    315.517     26.923  11.719  < 2e-16 ***
## temp:hr17                    569.891     26.684  21.357  < 2e-16 ***
## temp:hr18                    565.865     26.565  21.301  < 2e-16 ***
## temp:hr19                    434.331     26.655  16.295  < 2e-16 ***
## temp:hr20                    303.636     26.738  11.356  < 2e-16 ***
## temp:hr21                    191.067     26.924   7.097 1.33e-12 ***
## temp:hr22                    100.882     27.066   3.727 0.000194 ***
## temp:hr23                         NA         NA      NA       NA    
## hum:hr1                       13.718     30.007   0.457 0.647545    
## hum:hr2                       26.178     30.329   0.863 0.388074    
## hum:hr3                       42.385     31.456   1.347 0.177856    
## hum:hr4                       54.370     31.816   1.709 0.087483 .  
## hum:hr5                       35.891     31.645   1.134 0.256748    
## hum:hr6                       16.070     31.122   0.516 0.605617    
## hum:hr7                      -32.245     30.647  -1.052 0.292745    
## hum:hr8                      -62.430     30.252  -2.064 0.039064 *  
## hum:hr9                      -46.046     29.542  -1.559 0.119095    
## hum:hr10                     -87.280     29.232  -2.986 0.002833 ** 
## hum:hr11                    -116.807     29.030  -4.024 5.75e-05 ***
## hum:hr12                    -166.615     28.952  -5.755 8.82e-09 ***
## hum:hr13                    -185.503     28.951  -6.408 1.52e-10 ***
## hum:hr14                    -187.703     28.800  -6.518 7.35e-11 ***
## hum:hr15                    -182.570     28.561  -6.392 1.68e-10 ***
## hum:hr16                    -201.586     28.365  -7.107 1.23e-12 ***
## hum:hr17                    -254.651     28.105  -9.061  < 2e-16 ***
## hum:hr18                    -215.800     28.146  -7.667 1.85e-14 ***
## hum:hr19                    -174.028     28.186  -6.174 6.80e-10 ***
## hum:hr20                    -111.621     28.358  -3.936 8.31e-05 ***
## hum:hr21                     -72.597     28.724  -2.527 0.011500 *  
## hum:hr22                     -41.303     28.889  -1.430 0.152820    
## hum:hr23                      -9.223     29.302  -0.315 0.752947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.56 on 17249 degrees of freedom
## Multiple R-squared:  0.7471, Adjusted R-squared:  0.7452 
## F-statistic:   395 on 129 and 17249 DF,  p-value: < 2.2e-16

The linear regression model explains about 74.5 of the variability of the data. This value has already been calculated in the section “Model development”.

Non-linear Regression

gam.bike.1 <- gam(cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + s(hum) + s(temp) + s(atemp) + atemp:season + hum:season + hum:yr + atemp:mnth + temp:hr + hum:hr, data = d.bike)
summary(gam.bike.1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + 
##     as.factor(weathersit) + s(hum) + s(temp) + s(atemp) + atemp:season + 
##     hum:season + hum:yr + atemp:mnth + temp:hr + hum:hr
## 
## Parametric coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -43.398     16.243  -2.672 0.007553 ** 
## as.factor(season)2       74.999     17.194   4.362 1.30e-05 ***
## as.factor(season)3      293.571     38.505   7.624 2.58e-14 ***
## as.factor(season)4      195.233     16.481  11.846  < 2e-16 ***
## as.factor(yr)1          171.915      4.985  34.488  < 2e-16 ***
## as.factor(mnth)2         11.712     10.671   1.098 0.272415    
## as.factor(mnth)3        -20.249     13.820  -1.465 0.142874    
## as.factor(mnth)4        -10.974     23.573  -0.466 0.641566    
## as.factor(mnth)5         19.163     30.338   0.632 0.527623    
## as.factor(mnth)6          9.738     37.840   0.257 0.796904    
## as.factor(mnth)7        -46.450     50.499  -0.920 0.357685    
## as.factor(mnth)8       -117.640     47.115  -2.497 0.012539 *  
## as.factor(mnth)9       -183.342     40.241  -4.556 5.25e-06 ***
## as.factor(mnth)10      -139.991     23.235  -6.025 1.72e-09 ***
## as.factor(mnth)11      -123.205     20.955  -5.880 4.19e-09 ***
## as.factor(mnth)12       -97.427     15.571  -6.257 4.02e-10 ***
## as.factor(hr)1           -7.863     22.719  -0.346 0.729288    
## as.factor(hr)2          -20.528     22.883  -0.897 0.369683    
## as.factor(hr)3          -37.976     23.592  -1.610 0.107486    
## as.factor(hr)4          -51.447     23.989  -2.145 0.031995 *  
## as.factor(hr)5          -32.069     23.623  -1.358 0.174623    
## as.factor(hr)6           -8.429     23.330  -0.361 0.717871    
## as.factor(hr)7           73.174     23.363   3.132 0.001739 ** 
## as.factor(hr)8          214.416     23.350   9.183  < 2e-16 ***
## as.factor(hr)9          130.314     22.802   5.715 1.12e-08 ***
## as.factor(hr)10          94.999     22.551   4.213 2.54e-05 ***
## as.factor(hr)11         111.927     22.270   5.026 5.06e-07 ***
## as.factor(hr)12         153.188     22.199   6.901 5.36e-12 ***
## as.factor(hr)13         157.793     22.238   7.096 1.34e-12 ***
## as.factor(hr)14         148.261     22.286   6.653 2.96e-11 ***
## as.factor(hr)15         153.379     22.270   6.887 5.89e-12 ***
## as.factor(hr)16         156.534     22.178   7.058 1.75e-12 ***
## as.factor(hr)17         198.021     22.045   8.982  < 2e-16 ***
## as.factor(hr)18         154.107     21.916   7.032 2.12e-12 ***
## as.factor(hr)19          96.895     21.859   4.433 9.36e-06 ***
## as.factor(hr)20          49.630     21.866   2.270 0.023237 *  
## as.factor(hr)21          33.095     22.013   1.503 0.132740    
## as.factor(hr)22          19.122     22.056   0.867 0.385964    
## as.factor(hr)23           5.411     22.289   0.243 0.808201    
## as.factor(weathersit)2   -7.856      1.751  -4.486 7.32e-06 ***
## as.factor(weathersit)3  -54.632      3.073 -17.780  < 2e-16 ***
## as.factor(weathersit)4  -25.399     53.270  -0.477 0.633514    
## atemp:season1            64.728     27.234   2.377 0.017478 *  
## atemp:season2           106.303     21.106   5.037 4.79e-07 ***
## atemp:season3          -162.859     43.682  -3.728 0.000193 ***
## atemp:season4           -27.225     29.585  -0.920 0.357465    
## season1:hum              98.507      7.087  13.900  < 2e-16 ***
## season2:hum              -3.243      7.487  -0.433 0.664897    
## season3:hum             -83.543     10.436  -8.005 1.27e-15 ***
## season4:hum             -33.359      7.958  -4.192 2.78e-05 ***
## hum:yr1                -140.056      7.643 -18.324  < 2e-16 ***
## atemp:mnth2              -3.675     36.874  -0.100 0.920614    
## atemp:mnth3             136.612     41.313   3.307 0.000946 ***
## atemp:mnth4              74.756     60.744   1.231 0.218461    
## atemp:mnth5              51.090     69.040   0.740 0.459305    
## atemp:mnth6              72.643     77.665   0.935 0.349630    
## atemp:mnth7             181.354     91.866   1.974 0.048384 *  
## atemp:mnth8             279.218     87.749   3.182 0.001465 ** 
## atemp:mnth9             413.787     80.584   5.135 2.85e-07 ***
## atemp:mnth10            362.589     61.593   5.887 4.01e-09 ***
## atemp:mnth11            282.847     60.284   4.692 2.73e-06 ***
## atemp:mnth12            278.221     50.276   5.534 3.18e-08 ***
## temp:hr0               -207.137     19.211 -10.782  < 2e-16 ***
## temp:hr1               -254.734     19.530 -13.043  < 2e-16 ***
## temp:hr2               -271.327     20.035 -13.543  < 2e-16 ***
## temp:hr3               -286.177     20.814 -13.749  < 2e-16 ***
## temp:hr4               -290.955     20.987 -13.864  < 2e-16 ***
## temp:hr5               -272.603     20.757 -13.133  < 2e-16 ***
## temp:hr6               -161.400     20.234  -7.977 1.60e-15 ***
## temp:hr7                 46.172     18.804   2.455 0.014079 *  
## temp:hr8                 89.153     17.707   5.035 4.83e-07 ***
## temp:hr9                -66.912     17.214  -3.887 0.000102 ***
## temp:hr10               -42.833     16.954  -2.526 0.011532 *  
## temp:hr11                11.794     17.131   0.688 0.491165    
## temp:hr12                63.955     17.239   3.710 0.000208 ***
## temp:hr13                64.724     17.506   3.697 0.000219 ***
## temp:hr14                56.961     17.709   3.217 0.001300 ** 
## temp:hr15                61.397     17.820   3.445 0.000572 ***
## temp:hr16               181.538     17.811  10.193  < 2e-16 ***
## temp:hr17               436.159     17.556  24.844  < 2e-16 ***
## temp:hr18               431.602     17.491  24.675  < 2e-16 ***
## temp:hr19               301.448     17.769  16.965  < 2e-16 ***
## temp:hr20               171.746     18.000   9.541  < 2e-16 ***
## temp:hr21                59.350     18.371   3.231 0.001237 ** 
## temp:hr22               -30.904     18.631  -1.659 0.097192 .  
## temp:hr23              -131.624     18.984  -6.933 4.26e-12 ***
## hum:hr1                  14.003     29.987   0.467 0.640530    
## hum:hr2                  26.977     30.311   0.890 0.373486    
## hum:hr3                  43.368     31.434   1.380 0.167713    
## hum:hr4                  55.351     31.795   1.741 0.081728 .  
## hum:hr5                  37.101     31.620   1.173 0.240682    
## hum:hr6                  18.252     31.092   0.587 0.557190    
## hum:hr7                 -30.967     30.621  -1.011 0.311888    
## hum:hr8                 -60.825     30.227  -2.012 0.044204 *  
## hum:hr9                 -45.273     29.524  -1.533 0.125190    
## hum:hr10                -86.223     29.215  -2.951 0.003168 ** 
## hum:hr11               -114.445     29.011  -3.945 8.02e-05 ***
## hum:hr12               -164.486     28.933  -5.685 1.33e-08 ***
## hum:hr13               -182.998     28.932  -6.325 2.59e-10 ***
## hum:hr14               -185.077     28.781  -6.431 1.30e-10 ***
## hum:hr15               -179.969     28.544  -6.305 2.95e-10 ***
## hum:hr16               -199.170     28.344  -7.027 2.19e-12 ***
## hum:hr17               -252.483     28.083  -8.991  < 2e-16 ***
## hum:hr18               -214.538     28.126  -7.628 2.51e-14 ***
## hum:hr19               -172.730     28.166  -6.132 8.84e-10 ***
## hum:hr20               -110.503     28.340  -3.899 9.69e-05 ***
## hum:hr21                -72.256     28.706  -2.517 0.011843 *  
## hum:hr22                -40.665     28.876  -1.408 0.159067    
## hum:hr23                 -9.072     29.285  -0.310 0.756722    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df      F p-value    
## s(hum)   8.801  8.973 10.864  <2e-16 ***
## s(temp)  6.954  7.939 16.889  <2e-16 ***
## s(atemp) 5.788  6.908  1.734  0.0933 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 133/136
## R-sq.(adj) =  0.745   Deviance explained = 74.7%
## GCV = 8437.1  Scale est. = 8375.2    n = 17379

The non-linear regression model shows the same result as the linear regression model. The variability of the data is explained by 74.5%.

Poisson Regression

poi.bike.1 <- glm(final.model.1, data = d.bike)
summary(poi.bike.1)
## 
## Call:
## glm(formula = final.model.1, data = d.bike)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -418.83   -51.19    -2.54    41.54   431.19  
## 
## Coefficients: (3 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -143.267     25.714  -5.572 2.56e-08 ***
## as.factor(season)2            73.324     17.205   4.262 2.04e-05 ***
## as.factor(season)3           298.104     38.588   7.725 1.18e-14 ***
## as.factor(season)4           195.091     16.508  11.818  < 2e-16 ***
## as.factor(yr)1               172.268      4.990  34.521  < 2e-16 ***
## as.factor(mnth)2              10.876     10.721   1.014 0.310367    
## as.factor(mnth)3             -20.106     13.838  -1.453 0.146264    
## as.factor(mnth)4              -9.261     23.624  -0.392 0.695051    
## as.factor(mnth)5              21.410     30.430   0.704 0.481711    
## as.factor(mnth)6               7.077     38.071   0.186 0.852537    
## as.factor(mnth)7             -40.325     50.737  -0.795 0.426750    
## as.factor(mnth)8            -113.138     47.319  -2.391 0.016815 *  
## as.factor(mnth)9            -182.423     40.325  -4.524 6.11e-06 ***
## as.factor(mnth)10           -136.993     23.215  -5.901 3.68e-09 ***
## as.factor(mnth)11           -124.801     20.972  -5.951 2.72e-09 ***
## as.factor(mnth)12            -99.478     15.568  -6.390 1.70e-10 ***
## as.factor(hr)1                -7.430     22.747  -0.327 0.743961    
## as.factor(hr)2               -20.040     22.900  -0.875 0.381518    
## as.factor(hr)3               -37.708     23.616  -1.597 0.110357    
## as.factor(hr)4               -51.305     24.012  -2.137 0.032644 *  
## as.factor(hr)5               -32.269     23.650  -1.364 0.172442    
## as.factor(hr)6                -7.857     23.362  -0.336 0.736631    
## as.factor(hr)7                73.630     23.393   3.148 0.001649 ** 
## as.factor(hr)8               215.553     23.377   9.221  < 2e-16 ***
## as.factor(hr)9               130.726     22.817   5.729 1.02e-08 ***
## as.factor(hr)10               94.937     22.567   4.207 2.60e-05 ***
## as.factor(hr)11              112.811     22.286   5.062 4.19e-07 ***
## as.factor(hr)12              154.479     22.217   6.953 3.70e-12 ***
## as.factor(hr)13              159.391     22.255   7.162 8.27e-13 ***
## as.factor(hr)14              149.476     22.303   6.702 2.12e-11 ***
## as.factor(hr)15              154.276     22.290   6.921 4.63e-12 ***
## as.factor(hr)16              157.121     22.198   7.078 1.52e-12 ***
## as.factor(hr)17              198.580     22.063   9.001  < 2e-16 ***
## as.factor(hr)18              153.956     21.929   7.021 2.29e-12 ***
## as.factor(hr)19               97.181     21.871   4.443 8.91e-06 ***
## as.factor(hr)20               50.504     21.879   2.308 0.020994 *  
## as.factor(hr)21               33.505     22.027   1.521 0.128254    
## as.factor(hr)22               19.421     22.067   0.880 0.378822    
## as.factor(hr)23                5.565     22.301   0.250 0.802963    
## as.factor(weathersit)2        -7.823      1.752  -4.464 8.10e-06 ***
## as.factor(weathersit)3       -54.424      3.076 -17.694  < 2e-16 ***
## as.factor(weathersit)4       -23.638     53.297  -0.444 0.657400    
## poly(hum, degree = 8.7)1     957.555    586.675   1.632 0.102661    
## poly(hum, degree = 8.7)2    -914.302    120.837  -7.566 4.03e-14 ***
## poly(hum, degree = 8.7)3    -177.031     98.917  -1.790 0.073521 .  
## poly(hum, degree = 8.7)4     384.352     99.925   3.846 0.000120 ***
## poly(hum, degree = 8.7)5    -112.904     95.333  -1.184 0.236309    
## poly(hum, degree = 8.7)6     -22.383     93.126  -0.240 0.810065    
## poly(hum, degree = 8.7)7      64.497     92.906   0.694 0.487556    
## poly(hum, degree = 8.7)8    -377.214     92.651  -4.071 4.69e-05 ***
## poly(temp, degree = 8)1    -2134.381   1133.604  -1.883 0.059741 .  
## poly(temp, degree = 8)2    -3474.213    464.949  -7.472 8.26e-14 ***
## poly(temp, degree = 8)3    -2777.949    417.254  -6.658 2.87e-11 ***
## poly(temp, degree = 8)4     -457.491    270.972  -1.688 0.091365 .  
## poly(temp, degree = 8)5      213.518    248.039   0.861 0.389347    
## poly(temp, degree = 8)6      328.755    198.040   1.660 0.096925 .  
## poly(temp, degree = 8)7      177.723    136.542   1.302 0.193071    
## poly(temp, degree = 8)8      -55.473    124.113  -0.447 0.654912    
## poly(atemp, degree = 8.9)1 -1740.835   1685.322  -1.033 0.301647    
## poly(atemp, degree = 8.9)2  1224.997    461.324   2.655 0.007929 ** 
## poly(atemp, degree = 8.9)3   728.013    390.406   1.865 0.062233 .  
## poly(atemp, degree = 8.9)4  -371.555    259.581  -1.431 0.152344    
## poly(atemp, degree = 8.9)5  -294.540    246.098  -1.197 0.231384    
## poly(atemp, degree = 8.9)6     8.645    195.419   0.044 0.964715    
## poly(atemp, degree = 8.9)7   229.686    167.824   1.369 0.171138    
## poly(atemp, degree = 8.9)8    59.002    147.736   0.399 0.689622    
## atemp:season1                 91.303     40.433   2.258 0.023949 *  
## atemp:season2                137.184     37.128   3.695 0.000221 ***
## atemp:season3               -136.185     66.492  -2.048 0.040563 *  
## atemp:season4                     NA         NA      NA       NA    
## season1:hum                  132.432     11.467  11.549  < 2e-16 ***
## season2:hum                   30.183     11.509   2.623 0.008733 ** 
## season3:hum                  -56.517     14.683  -3.849 0.000119 ***
## season4:hum                       NA         NA      NA       NA    
## hum:yr1                     -140.467      7.649 -18.363  < 2e-16 ***
## atemp:mnth2                   -1.972     37.042  -0.053 0.957546    
## atemp:mnth3                  137.136     41.381   3.314 0.000922 ***
## atemp:mnth4                   71.317     60.877   1.172 0.241412    
## atemp:mnth5                   46.588     69.212   0.673 0.500884    
## atemp:mnth6                   76.765     78.079   0.983 0.325535    
## atemp:mnth7                  172.625     92.254   1.871 0.061336 .  
## atemp:mnth8                  272.623     88.129   3.093 0.001981 ** 
## atemp:mnth9                  411.659     80.779   5.096 3.50e-07 ***
## atemp:mnth10                 357.439     61.564   5.806 6.51e-09 ***
## atemp:mnth11                 288.204     60.364   4.774 1.82e-06 ***
## atemp:mnth12                 285.572     50.310   5.676 1.40e-08 ***
## temp:hr0                     -75.269     27.563  -2.731 0.006325 ** 
## temp:hr1                    -123.094     27.696  -4.444 8.87e-06 ***
## temp:hr2                    -139.435     28.057  -4.970 6.77e-07 ***
## temp:hr3                    -153.613     28.643  -5.363 8.29e-08 ***
## temp:hr4                    -157.870     28.775  -5.486 4.16e-08 ***
## temp:hr5                    -138.378     28.590  -4.840 1.31e-06 ***
## temp:hr6                     -27.234     28.181  -0.966 0.333858    
## temp:hr7                     179.174     27.123   6.606 4.06e-11 ***
## temp:hr8                     220.824     26.350   8.380  < 2e-16 ***
## temp:hr9                      65.043     26.066   2.495 0.012594 *  
## temp:hr10                     90.520     26.011   3.480 0.000502 ***
## temp:hr11                    145.219     26.249   5.532 3.21e-08 ***
## temp:hr12                    196.763     26.416   7.449 9.87e-14 ***
## temp:hr13                    197.233     26.662   7.398 1.45e-13 ***
## temp:hr14                    190.124     26.853   7.080 1.49e-12 ***
## temp:hr15                    195.200     26.954   7.242 4.61e-13 ***
## temp:hr16                    315.517     26.923  11.719  < 2e-16 ***
## temp:hr17                    569.891     26.684  21.357  < 2e-16 ***
## temp:hr18                    565.865     26.565  21.301  < 2e-16 ***
## temp:hr19                    434.331     26.655  16.295  < 2e-16 ***
## temp:hr20                    303.636     26.738  11.356  < 2e-16 ***
## temp:hr21                    191.067     26.924   7.097 1.33e-12 ***
## temp:hr22                    100.882     27.066   3.727 0.000194 ***
## temp:hr23                         NA         NA      NA       NA    
## hum:hr1                       13.718     30.007   0.457 0.647545    
## hum:hr2                       26.178     30.329   0.863 0.388074    
## hum:hr3                       42.385     31.456   1.347 0.177856    
## hum:hr4                       54.370     31.816   1.709 0.087483 .  
## hum:hr5                       35.891     31.645   1.134 0.256748    
## hum:hr6                       16.070     31.122   0.516 0.605617    
## hum:hr7                      -32.245     30.647  -1.052 0.292745    
## hum:hr8                      -62.430     30.252  -2.064 0.039064 *  
## hum:hr9                      -46.046     29.542  -1.559 0.119095    
## hum:hr10                     -87.280     29.232  -2.986 0.002833 ** 
## hum:hr11                    -116.807     29.030  -4.024 5.75e-05 ***
## hum:hr12                    -166.615     28.952  -5.755 8.82e-09 ***
## hum:hr13                    -185.503     28.951  -6.408 1.52e-10 ***
## hum:hr14                    -187.703     28.800  -6.518 7.35e-11 ***
## hum:hr15                    -182.570     28.561  -6.392 1.68e-10 ***
## hum:hr16                    -201.586     28.365  -7.107 1.23e-12 ***
## hum:hr17                    -254.651     28.105  -9.061  < 2e-16 ***
## hum:hr18                    -215.800     28.146  -7.667 1.85e-14 ***
## hum:hr19                    -174.028     28.186  -6.174 6.80e-10 ***
## hum:hr20                    -111.621     28.358  -3.936 8.31e-05 ***
## hum:hr21                     -72.597     28.724  -2.527 0.011500 *  
## hum:hr22                     -41.303     28.889  -1.430 0.152820    
## hum:hr23                      -9.223     29.302  -0.315 0.752947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 8383.607)
## 
##     Null deviance: 571761591  on 17378  degrees of freedom
## Residual deviance: 144608842  on 17249  degrees of freedom
## AIC: 206453
## 
## Number of Fisher Scoring iterations: 2

Comparing the models

To enable a meaningful comparison between the three models, the models should train 10 times with the training set and then predict the new data. Finally, the corresponding R-Squared of all three models is compared to each other to make a statement about the best model.

for(i in 1:10){
  d.bike.train.id <- sample(seq_len(nrow(d.bike)),size = floor(0.75*nrow(d.bike)))
  d.bike.train <- d.bike[d.bike.train.id,]
  d.bike.test <- d.bike[-d.bike.train.id,]
  
  #predict data with linear model
  lm.bike.1.train <- lm(final.model.1, data = d.bike.train)
  predicted.lm.bike.1.test <- predict(lm.bike.1.train,
                                             newdata = d.bike.test)
  r.squared.lm.bike.1 <- cor(predicted.lm.bike.1.test, d.bike.test$cnt)^2
  
  #predict data with non-linear model
  gam.bike.1.train <- gam(cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + s(hum) + s(temp) + s(atemp) + atemp:season + hum:season + hum:yr + atemp:mnth + temp:hr + hum:hr, data = d.bike.train)
  predicted.gam.bike.1.test <- predict(gam.bike.1.train,
                                        newdata = d.bike.test)
  r.squared.gam.bike.1 <- cor(predicted.gam.bike.1.test, d.bike.test$cnt)^2
  
  #predict data with poisson model
  poi.bike.1.train <- glm(final.model.1, data = d.bike)
  predicted.poi.bike.1.test <- predict(poi.bike.1.train,
                                        newdata = d.bike.test)
  r.squared.poi.bike.1 <- cor(predicted.poi.bike.1.test, d.bike.test$cnt)^2
}

mean(r.squared.lm.bike.1)
## [1] 0.7286108
mean(r.squared.gam.bike.1)
## [1] 0.728941
mean(r.squared.poi.bike.1)
## [1] 0.7328381

The result is quite close. But the best result is achieved by the poisson regressive model with an average R-Squared.

4. Trees

In this chapter a classification tree and a regression tree are used to predict cnt. Once as class and once as discrete value. Afterwards, the regression tree model is used to analyse the bagging, random forests and boosting procedures.

Classification Tree

To be able to predict a class using the classification tree, a class must first be created. For this purpose five classes are formed first. The classes indicate whether few (0) or many (5) bicycles were rented at the observed hour. The classes are formed according to a simple principle. The maximum observed value of cnt is divided by 5. Thus five equally large intervals are formed. Finally, the training and test data set is initialized with the classified data.

max(d.bike.new$cnt)
## [1] 977
max(d.bike.new$cnt)/5
## [1] 195.4
for(i in 1:nrow(d.bike)){
  if(d.bike.new$cnt[i] >= 0 & d.bike.new$cnt[i] < 196){
    d.bike.new$class[i] <- 0
  } else if (d.bike.new$cnt[i] >= 196 & d.bike.new$cnt[i] < 391){
    d.bike.new$class[i] <- 1
  } else if (d.bike.new$cnt[i] >= 391 & d.bike.new$cnt[i] < 586){
    d.bike.new$class[i] <- 2
  } else if (d.bike.new$cnt[i] >= 586 & d.bike.new$cnt[i] < 781){
    d.bike.new$class[i] <- 3
  } else if (d.bike.new$cnt[i] >= 781 & d.bike.new$cnt[i] <= 977){
    d.bike.new$class[i] <- 4
  }
}

d.bike.train.id <- sample(seq_len(nrow(d.bike.new)),size = floor(0.75*nrow(d.bike.new)))
d.bike.train.new <- d.bike.new[d.bike.train.id,]
d.bike.test.new <- d.bike.new[-d.bike.train.id,]

On the basis of the five defined classes, a classification tree is now fitted on the training data set. That it is a classification tree is shown by the y-variable class, which is loaded into the model as a factor. Cnt is removed from the model because it correlates too strongly with class.

tree.classification.bike.1 <- tree(as.factor(class) ~.-cnt, data = d.bike.train.new)
summary(tree.classification.bike.1)
## 
## Classification tree:
## tree(formula = as.factor(class) ~ . - cnt, data = d.bike.train.new)
## Variables actually used in tree construction:
## [1] "hr"   "temp" "yr"  
## Number of terminal nodes:  10 
## Residual mean deviance:  1.245 = 16210 / 13020 
## Misclassification error rate: 0.2774 = 3616 / 13034

The result shows the structure of the classification tree. There are 10 terminal nodes available. Not all variables were used to create the tree. Used were: hr, temp and yr. Now the tree should be plotted visually.

plot(tree.classification.bike.1)
text(tree.classification.bike.1, pretty=1, cex=0.75)

As can be seen, the attribute hr is the root and therefore the most important attribute. As the summary has already shown, only hr, yr and temp are used for the classification. Furthermore, there are only 10 leaves or end nodes.

Now a prediction is to be made on the already known data, so that hits and error rate can be calculated afterwards.

tree.classification.bike.pred.train <- predict(tree.classification.bike.1, d.bike.train.new, type="class")
(tree.classification.bike.pred.train.ct <- table(tree.classification.bike.pred.train, as.factor(d.bike.train.new$class)))
##                                    
## tree.classification.bike.pred.train    0    1    2    3    4
##                                   0 7587 1684  227    5    0
##                                   1  279 1340  624  124    0
##                                   2  107  146  314  112    4
##                                   3   38   38   89  177  139
##                                   4    0    0    0    0    0
tree.classification.bike.pred.train.correct <- 0
tree.classification.bike.pred.train.error <- 0
for (i1 in 1:3) {
  for (i2 in 1:3) {
    if (i1 == i2) {
      tree.classification.bike.pred.train.correct <- tree.classification.bike.pred.train.correct + tree.classification.bike.pred.train.ct[i1,i2]
    }else{
      tree.classification.bike.pred.train.error <- tree.classification.bike.pred.train.error + tree.classification.bike.pred.train.ct[i1,i2]
    }
  }
}
(tree.classification.bike.pred.train.rate <- tree.classification.bike.pred.train.correct/sum(tree.classification.bike.pred.train.ct)) 
## [1] 0.7089919
(tree.classification.bike.pred.train.error <- 1 - tree.classification.bike.pred.train.rate) 
## [1] 0.2910081

The table shows that the classification makes many errors. For example, data records are classified as 0 although they are not class 0. The hits and error rate confirms this. About 70% of the data objects are correctly classified and about 30% are incorrectly classified. What does the result now look like on the test data set?

tree.classification.bike.pred.test <- predict(tree.classification.bike.1, d.bike.test.new, type="class")
(tree.classification.bike.pred.test.ct <- table(tree.classification.bike.pred.test, as.factor(d.bike.test.new$class)))
##                                   
## tree.classification.bike.pred.test    0    1    2    3    4
##                                  0 2502  624   66    0    0
##                                  1   80  441  197   36    0
##                                  2   36   54  105   39    3
##                                  3   13   14   34   69   32
##                                  4    0    0    0    0    0
tree.classification.bike.pred.test.correct <- 0
tree.classification.bike.pred.test.error <- 0
for (i1 in 1:3) {
  for (i2 in 1:3) {
    if (i1 == i2) {
      tree.classification.bike.pred.test.correct <- tree.classification.bike.pred.test.correct + tree.classification.bike.pred.test.ct[i1,i2]
    }else{
      tree.classification.bike.pred.test.error <- tree.classification.bike.pred.test.error + tree.classification.bike.pred.test.ct[i1,i2]
    }
  }
}
(tree.classification.bike.pred.test.rate <- tree.classification.bike.pred.test.correct/sum(tree.classification.bike.pred.test.ct)) 
## [1] 0.701496
(tree.classification.bike.pred.test.error <- 1 - tree.classification.bike.pred.test.rate) 
## [1] 0.298504

The result is basically the same. The test data set is also classified with a hit rate of 70% and an error rate of 30%. Pruning could possibly improve this result.

tree.classification.bike.pruning <- cv.tree(tree.classification.bike.1, FUN = prune.misclass)
plot(tree.classification.bike.pruning$size, tree.classification.bike.pruning$dev, type="b")

plot(tree.classification.bike.pruning$k, tree.classification.bike.pruning$dev, type="b")

The graph shows that probably flattens from a size of 7. So for the parameter “best” 7 is chosen. Now the pruning can be performed.

prune.tree.classification.bike <- prune.misclass(tree.classification.bike.1, best=7) 
summary(prune.tree.classification.bike)
## 
## Classification tree:
## snip.tree(tree = tree.classification.bike.1, nodes = c(2L, 6L
## ))
## Variables actually used in tree construction:
## [1] "hr"   "temp" "yr"  
## Number of terminal nodes:  7 
## Residual mean deviance:  1.333 = 17360 / 13030 
## Misclassification error rate: 0.2787 = 3632 / 13034

This procedure enabled the tree to be pruned by three leaves. What does the pruned tree look like visually?

plot(prune.tree.classification.bike)
text(prune.tree.classification.bike,pretty=0)

hr is still the most important attribute and therefore the root of the tree. The left branch of the initial tree now disappears completely. The right branches could also be reduced. Now the performance prediction on the test data set is to be checked, so that a comparison with the initial tree can be made.

prune.tree.classification.bike.pred.test <- predict(prune.tree.classification.bike,  d.bike.test.new, type="class")
(prune.tree.classification.bike.pred.test.ct <- table(prune.tree.classification.bike.pred.test, as.factor(d.bike.test.new$class)))
##                                         
## prune.tree.classification.bike.pred.test    0    1    2    3    4
##                                        0 2536  645  101   21    3
##                                        1   80  441  197   36    0
##                                        2    2   33   70   18    0
##                                        3   13   14   34   69   32
##                                        4    0    0    0    0    0
(prune.tree.classification.bike.pred.test.correct <- sum(prune.tree.classification.bike.pred.test==as.factor(d.bike.test.new$class))/sum(prune.tree.classification.bike.pred.test.ct)) 
## [1] 0.7171461
(prune.tree.classification.bike.pred.test.error <- 1 - prune.tree.classification.bike.pred.test.correct)
## [1] 0.2828539
tree.classification.bike.pred.test.rate
## [1] 0.701496
tree.classification.bike.pred.test.error 
## [1] 0.298504

The table as well as the error and hit rates show that even the trimmed tree still makes errors in the classification. The improvement is - as can be seen from the lower two values - minimal.

Regression Tree

Count values are not bell-shaped distributed

hist(d.bike.train.new$cnt)

in this section the variable “cnt” will be predicted (–> factor)

table(d.bike.train.new$cnt) 
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 105 151 168 167 205 179 149 142  93 122 106  88  93  76  64  82  82  62  50  68 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##  72  51  61  58  52  72  55  59  44  45  55  54  45  47  49  48  41  39  43  47 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##  41  44  35  38  38  37  40  30  28  45  35  39  33  34  36  34  35  25  36  31 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##  27  41  32  51  30  28  24  32  42  33  39  37  26  25  40  28  24  43  39  26 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##  25  22  29  40  25  41  34  32  30  46  33  35  33  30  37  32  27  28  30  19 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##  24  33  28  31  30  37  27  41  27  29  24  32  33  40  23  34  28  39  30  26 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##  31  31  41  34  29  32  30  22  36  28  22  18  30  29  37  29  27  28  26  30 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##  34  27  23  23  29  22  32  29  23  29  32  33  38  35  23  26  31  20  32  22 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##  26  29  32  23  35  24  26  29  27  28  29  30  20  29  27  22  33  29  25  26 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##  32  33  20  23  33  19  26  30  27  37  20  23  21  26  25  25  23  24  16  17 
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
##  25  33  27  27  26  26  28  25  26  21  28  17  21  22  16  19  25  25  26  24 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##  27  24  23  30  21  24  22  25  29  26  12  29  32  18  22  17  26  24  21  19 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##  20  13  23  20  20  13  14  21  15  17  21  12  19  22  14  22  19  14  17  24 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##  16  11  23  21  12  14  20  19  15  24  15  21  14  26  19  25   9  12  13  20 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##  20  20  22  11  26  19  15  13  14  15  17  16  10  15  11  15  18  15  20  24 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##  14  17  15  17  14  14  10  24  10  15  13  14  24  17  16  18  13  10   8  13 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##  15  12  16   6  11  11  13  26   9  14  20  19  10  14  10  11  16  15   6  11 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##  11  14  17   9   8  10  13   5  12   7  11  10  11   9  13  12  11  12  13  15 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##  11  13  13   5  10   6  16   5  10   8  10  10  14  16   8  14  14  11   8  13 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##  11  13  12   8  11   7  12   8   9  10   9   8   3   9   8  10   9  13   7   8 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 
##  10  15   6  16   8   9   9  10   9  16   7   7  10   9   6  11   7   8   8  12 
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 
##  14   3   8   7   9   5   9  10   2   4   8  12   6   6   8   7   5   4   5   6 
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 
##   8   6   6   4   7   9   8   5   8   7   4  13   7   7  12  11   9   6   7   6 
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 
##   5   9   8  10   6  10   7   8   6   9   7   9   7   5   4   6   1   7   9   5 
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   8  11   8   2   3   7   8   5   7   4  10  11   9   8  11   5   9   6  10   6 
## 501 502 503 504 505 506 507 508 509 511 512 513 514 515 516 517 518 519 520 521 
##   5   6   7   7   9   2   6   9   6   4   5   8   5   4   6   6   4   3   5   7 
## 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 
##   4   3   3   6   5   5   5   5   7   8   2   1   2   2   6   4   3   8   5   7 
## 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 
##   4   6   4   2   6   8   1   3   5   6   5   4   3   7   8   3   5   6   7   2 
## 562 563 564 565 566 567 568 569 570 571 572 573 575 576 577 578 579 580 581 582 
##   4   7   4   6   3   5   8   6   6   3   3   5   1   5   2   4   6   2   4   6 
## 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 
##   2   4   7   5   1   2   3   7  10   2   4   5   2   7   2   2   2   1   3   2 
## 603 604 605 606 607 608 609 610 611 612 614 615 616 617 618 619 620 621 622 623 
##   2   3   4   2   4   3   2   4   2   2   6   3   2   4   2   1   2   1   3   3 
## 624 625 626 627 628 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 
##   3   2   6   6   4   3   2   3   4   1   1   5   2   4   4   3   4   4   2   6 
## 647 648 649 650 651 652 653 654 655 656 657 658 659 660 662 663 664 665 666 667 
##   3   2   4   1   4   1   5   7   2   2   2   2   2   1   4   2   2   3   2   2 
## 668 669 670 671 672 673 674 675 676 678 679 680 681 682 683 684 685 686 687 688 
##   6   1   1   4   2   3   2   1   1   3   3   3   4   2   3   1   1   5   2   2 
## 689 690 691 692 693 694 696 697 698 699 700 702 703 704 705 706 707 708 709 710 
##   2   2   5   2   4   3   1   1   2   2   2   2   2   2   3   1   1   1   1   3 
## 711 712 713 714 715 717 719 721 722 723 724 727 729 730 731 732 733 734 736 737 
##   4   2   4   1   3   1   2   2   2   3   3   1   4   2   3   2   1   1   1   2 
## 738 740 741 743 744 745 747 748 749 750 754 755 757 758 759 760 761 766 767 769 
##   1   1   2   3   3   2   1   1   1   4   1   1   3   1   2   2   1   2   1   1 
## 770 771 772 774 775 776 777 779 781 783 784 785 786 788 790 791 792 794 795 797 
##   3   1   2   1   1   1   1   1   2   1   2   2   1   2   2   2   1   2   2   1 
## 800 801 804 805 806 808 809 810 811 812 813 814 817 818 819 820 822 823 824 825 
##   2   4   1   2   1   4   2   2   2   6   2   2   1   2   2   2   3   2   1   1 
## 827 830 832 833 834 835 837 838 839 843 844 845 846 849 851 852 853 854 856 857 
##   3   1   1   1   2   3   2   2   4   2   2   1   2   1   1   1   2   1   1   2 
## 858 862 864 865 868 869 870 872 873 877 878 884 886 888 890 891 892 893 894 897 
##   3   1   1   1   1   1   1   2   1   2   1   2   2   1   1   2   1   1   1   1 
## 898 900 905 917 922 925 938 941 943 948 953 963 967 968 970 976 977 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

“cnt” is a categorical variable -> normally classification tree is to be used. but since it has more than 32 levels, regression tree is to be generated.

Proof: classification did not work with “cnt” because at most 32 levels are possible. cnt has more than 1000 levels. Following code would not work:

#tree.classification.bike <- tree(as.factor(cnt) ~ .-class, data = d.bike.train.new) 
#tree.classification.bike.pred <- predict(tree.classification.bike, d.bike.train.new, type="class")

regression tree is generated:

tree.regression.bike <- tree(cnt ~ .-class, data = d.bike.train.new) 
plot(tree.regression.bike)
text(tree.regression.bike, pretty=1, cex=0.75)

tree has 10 nodes. See summary:

summary(tree.regression.bike)
## 
## Regression tree:
## tree(formula = cnt ~ . - class, data = d.bike.train.new)
## Variables actually used in tree construction:
## [1] "hr"   "temp" "yr"  
## Number of terminal nodes:  10 
## Residual mean deviance:  11230 = 146300000 / 13020 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -554.50  -50.47  -17.24    0.00   53.76  514.80

….regression tree is used for prediction. prediction is made based on training data

tree.regression.bike.train.pred <- predict(tree.regression.bike, d.bike.train.new, type="vector")

predictions of regression tree with true “cnt” values are compared. it shows a positive correlation see graphic:

plot(tree.regression.bike.train.pred,d.bike.train.new$cnt)
abline (0 ,1) # compare with the function f(x)=x (intercept 0, slope 1)

errors are calculated (error residual calculation: [predicted “cnt” values] - [real “cnt” values])

error <- tree.regression.bike.train.pred-d.bike.train.new$cnt
element_ID <- 1:length(error)

Analysis of the resuduals: The majority of the residents are within the frequency of 200 see graphic:

plot(element_ID,error)
title(main="Analysis of the residuals")
abline(0 ,0, lwd=5, col="skyblue", lty="dotted")
abline(200 ,0, lwd=5, col="red", lty="dotted")
abline(-200 ,0, lwd=5, col="red", lty="dotted")

Histogram of error: Most errors do not seem to be over 200 up and down (as already seen in the residence analysis above). The errors appear to be bell shaped. see graphic:

hist(error)

some numbers on train data RSS: 146263805 MSE: 11221.71 deviation: 105.9326 …calculated with following code:

(RSS <- sum((d.bike.train.new["cnt"]-tree.regression.bike.train.pred)^2))
## [1] 146263805
(MSE <- mean(((d.bike.train.new["cnt"]-tree.regression.bike.train.pred)^2)$cnt))
## [1] 11221.71
(deviation <- sqrt(MSE))
## [1] 105.9326

a prediction is made based on test data

tree.regression.bike.test.pred <- predict(tree.regression.bike, d.bike.test.new, type="vector")

MSE: An MSE comparison is made between train and test data. Both MSE-values are very close to each other and this shows a very good performance of the tree. MSE of training data: 11221.71 MSE of test data: 10638.24

(MSE.train <- mean(((d.bike.train.new["cnt"]-tree.regression.bike.train.pred)^2)$cnt))
## [1] 11221.71
(MSE.test <- mean(((d.bike.test.new["cnt"]-tree.regression.bike.test.pred)^2)$cnt))
## [1] 10638.24

Graphical comparison of error residuals between training and test data: result: As one can see in the graph, the distribution of error rates between training and test data looks the same –> good performance of the decision tree! see graphic:

errors.2.in <- predict(tree.regression.bike, d.bike.train.new, type="vector")-d.bike.train.new$cnt
element.2.in <- as.integer(names(errors.2.in))
errors.2.in_dataframe <- tibble(element.2.in,errors.2.in,"TRAIN")
colnames(errors.2.in_dataframe) <- c('ID','error','type')
errors.2 <- predict(tree.regression.bike, d.bike.test.new, type="vector")-d.bike.test.new$cnt
element.2 <- as.integer(names(errors.2))
errors.2.out_dataframe <- tibble(element.2,errors.2,"TEST")
colnames(errors.2.out_dataframe) <- c('ID','error','type')

errors.2_dataframe <- bind_rows(errors.2.in_dataframe,errors.2.out_dataframe) 
errors.2_dataframe <- arrange(errors.2_dataframe, ID)

ggplot(data = errors.2_dataframe, mapping = aes(x = ID,y = error, color = type)) + 
  geom_point() + geom_boxplot(alpha = 0.5)

pruning of regression tree: The average deviance is influenced by the number of leaves. Up to four leaves the deviance decreases significantly. From then on the the deviance becomes stable See graphic:

tree.regression.bike.pruning = cv.tree(tree.regression.bike, FUN = prune.tree)
plot(tree.regression.bike.pruning)

cross-validation error-rate based on size and k:

In the graphic below one can see again, that from the 4th node on, the curve flattens out:

par(mfrow=c(1,2))
plot(tree.regression.bike.pruning$size, tree.regression.bike.pruning$dev, type="b")

On this graphic, one can see that after 5.0e+0.7 k, the curve flattens out:

plot(tree.regression.bike.pruning$k, tree.regression.bike.pruning$dev, type="b")

par(mfrow=c(1,1))

–> On this level the number of leaves should be chosen.

Based on the findings above, the tree was created once with 4 and once with 5 nodes. With the tree with 5 nodes, the goal was to find out whether the expected lower mean deviance would allow a better tree. Result: Both trees (generated below) are good, none of them has unnecessary decisions. see graphics of both trees:

Tree with 4 nodes:

tree.regression.bike.pruned <- prune.tree(tree.regression.bike, best = 5)
plot(tree.regression.bike.pruned)
text(tree.regression.bike.pruned, pretty=1, cex=0.75)

Tree with 5 nodes:

tree.regression.bike.pruned2 <- prune.tree(tree.regression.bike, best = 4)
plot(tree.regression.bike.pruned2)
text(tree.regression.bike.pruned2, pretty=1, cex=0.75)

Summary of both trees: mean deviance of tree with 5 leaves (“tree.regression.bike.pruned”): 14790 mean deviance of tree with 4 leaves (“tree.regression.bike.pruned2”): 16100 For the further part of the residual tree analysis (only within this section), the tree with five nodes is to be preferred, since it has a lower mean deviance. sidenote: in the summary of the trees, one can also see that only the dimensions “hr” and “temp” are used in both trees after the prune. summary of tree with 5 leaves:

summary(tree.regression.bike.pruned)
## 
## Regression tree:
## snip.tree(tree = tree.regression.bike, nodes = c(15L, 6L, 14L
## ))
## Variables actually used in tree construction:
## [1] "hr"   "temp"
## Number of terminal nodes:  5 
## Residual mean deviance:  14790 = 192700000 / 13030 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -544.70  -64.02  -20.24    0.00   50.76  635.70

summary of tree with 4 leaves:

summary(tree.regression.bike.pruned2)
## 
## Regression tree:
## snip.tree(tree = tree.regression.bike, nodes = c(15L, 6L, 14L, 
## 2L))
## Variables actually used in tree construction:
## [1] "hr"   "temp"
## Number of terminal nodes:  4 
## Residual mean deviance:  16100 = 209800000 / 13030 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -544.70  -68.60  -31.12    0.00   60.37  635.70

Next was about prediction tests on the recently pruned tree with five nodes. Result: With train data we came to an MSE of 14785.88 and with test data to 14245.44. –> These two values are very close to each other and this shows a very good performance of the tree.

Prediction using pruned tree based on training data:

tree.regression.bike.pruned.train.pred <- predict(tree.regression.bike.pruned, d.bike.train.new, type="vector")
(MSE.pruned.train <- mean(((d.bike.train.new["cnt"]-tree.regression.bike.pruned.train.pred)^2)$cnt))
## [1] 14785.88

Prediction using pruned tree based on test data:

tree.regression.bike.pruned.test.pred <- predict(tree.regression.bike.pruned, d.bike.test.new, type="vector")
(MSE.pruned.test <- mean(((d.bike.test.new["cnt"]-tree.regression.bike.pruned.test.pred)^2)$cnt))
## [1] 14245.44

Bagging

In this section, the bagging shall be performed on the regression tree model to reduce the variance. As y-variable not the class is used but cnt. The number of Bootstrap replication is set to 25. It should no cross validation be used as set with parameter coob.

bag.bike=bagging(cnt~.-class, data=d.bike.train.new, nbagg=25, coob =TRUE)
print(bag.bike)
## 
## Bagging regression trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = cnt ~ . - class, data = d.bike.train.new, 
##     nbagg = 25, coob = TRUE)
## 
## Out-of-bag estimate of root mean squared error:  51.1638

The root mean squared error (RMSE) is 51.1286. The mean squared error (MSE) is therefore about 2614.

Now prediction is to be performed by means of bagging on the test set. The result is plotted.

yhat.bag = predict(bag.bike,newdata=d.bike.test.new)
plot(yhat.bag, as.factor(d.bike.test.new$cnt))
abline(0,1)

The graphic representation shows us what the MSE has already shown us. The devinace from the estimate (straight line) and thus the variance is large.

mean((yhat.bag-d.bike.test.new$cnt)^2)
## [1] 2652.006

The calculation of the MSE from the forecast confirms this a second time.

Random Forest

In this section the random forest method is used for the already discussed analysis. However, here the “class” variable is used as the target value instead of the “cnt” variable. The random forest function could not handle cnt because it has too many levels (see https://stats.stackexchange.com/questions/37370/random-forest-computing-time-in-r) . Mtry is set to 6 for this anlysis.

The aim of this analysis is to calculate an MSE value again and to investigate the importance of the individual explaining variables.

run the initial function:

rf.bike=randomForest(as.numeric(as.character(class)) ~ .-cnt,data=d.bike.train.new, mtry=6,importance =TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

predict on test set:

yhat.rf = predict(rf.bike ,newdata=d.bike.test.new)

performance (MSE) result: the mse is at 0.273

mean((yhat.rf-d.bike.test.new$class)^2)
## [1] 0.2264341

With the two masses “increase of MSE in percent” and “inrease in node purity” it can be seen that “hr” is by far the most important variable. This can be seen in the table and graphic below:

importance(rf.bike)
##              %IncMSE IncNodePurity
## season      64.31021      286.7481
## yr         229.76703     1028.8263
## mnth        73.34177      574.5719
## hr         504.17684     4329.8716
## weathersit  71.58506      277.5074
## temp        58.55109     1087.6280
## atemp       49.33936      941.7789
## hum         80.86539      889.7663
varImpPlot (rf.bike)

Boosting

In this section Boosting will be used to improve the model. Since our data set is count data, we choose the Poisson distribution. The number of trees to be generated in boosting is set to 1000.

boost.bike=gbm(cnt~.-class,data=d.bike.train.new,
               distribution="poisson",n.trees=1000, interaction.depth=4)
summary(boost.bike)

##                   var   rel.inf
## hr                 hr 57.990601
## temp             temp 10.750827
## yr                 yr 10.612623
## mnth             mnth  5.998318
## atemp           atemp  5.162263
## hum               hum  4.030658
## season         season  3.391951
## weathersit weathersit  2.062759

By far the most important attribute is hr. As the result shows the attributes yr and temp are also important. This is not surprising, since the tree developed in the section “Classification Tree” was already based on these three attributes.

plot(boost.bike ,i="hr") 

plot(boost.bike ,i="yr")

The graphical representation of the two most important variables hr and yr shows a clear effect on cnt. The cnt is different in the individual hours as well as in the two years.

A prediction is performed to check performance on the test data record. As in the previous examples, performance is assessed using MSE.

yhat.boost=predict(boost.bike,newdata=d.bike.test.new, n.trees=1000)
mean((yhat.boost -d.bike.test.new$cnt)^2)
## [1] 65722.11

The MSE is very high at 66442. About thirty times as high as in the example of bagging (see Bagging). Adjustments to the shrinkage parameter have no great effect (see line 668, group13_bikesharing.R).

5. Support Vector Machines

The fifth chapter deals with the classification of data using Support Vector Machines (SVM). Two SVMs are trained and tested. In the first case, the data is classified into two classes using SVM with two predictors. In the second case, the data is classified in five classes using the final model.

2-class linear SVM with casual und atemp

An interesting classification is the division of the data in cyclists without subscription (casual) into those who ride during cool temperatures and those who ride during very warm temperatures. The two attributes casual and breath are considered. In a plot, a quite centered seperating hyperplane is supposed to divide the data into two classes. For this purpose an intercept of the x-value atemp of 0.5525 is defined straight-forward, which separates the two types of cyclist visually.

xi <- 0.5525
ggplot(data = d.bike, mapping = aes(y = casual, x = atemp)) + geom_point() + geom_vline(xintercept = xi, color = "blue")

The scatterplot shows the blue hyperplane.

The next step is to classify the data so that the SVM model knows which classes to determine. This is a prerequisite in the classification. A simple if-statement can be defined for classification. If atemp is smaller than the vertical intercept defined in the visual example, it is class 0, i.e. those cyclists without a subscription who rent a bicycle even at cooler temperatures. If the observed atemp is greater than or equal to the defined intercept of 0.5525, then we are dealing with class 1, i.e. a cyclist who rents a bicycle even in warmer temperatures. A for-loop is used to perform the classification for each observed value.

for(i in 1:nrow(d.bike)){
  if(d.bike$atemp[i] < 0.5525){
    d.bike$class[i] <- 0
  } else if (d.bike$atemp[i] >= 0.5525){
    d.bike$class[i] <- 1
  }
}

c.bike <- data.frame(x=d.bike, y=as.factor(d.bike$class))
ggplot(data = c.bike, mapping = aes(y = x.casual, x = x.atemp,  color=y)) + geom_point() + geom_vline(xintercept = xi, size = 1, alpha = 0.5)

The figure shows the two classes separated by color.

Now that the classes and thus the target variable are defined, an SVM model can be fitted. As can be seen from the summary, 628 support vectors are available at the cost of 10. 316 of them are in class 0 and 312 in class 1.

svm.bike.1 <- svm(y~x.casual+x.atemp,
                  data = c.bike,
                  kernel = "linear",
                  cost = 10,
                  scale = FALSE)
summary(svm.bike.1)
## 
## Call:
## svm(formula = y ~ x.casual + x.atemp, data = c.bike, kernel = "linear", 
##     cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  628
## 
##  ( 316 312 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
plot(svm.bike.1, c.bike, x.casual~x.atemp)

The plot shows the classification and the support vectors.

In order to obtain the best model with optimal cost factor, a vector with different cost values is first defined and the tune-function is used.

cost_range <-
  c(0.01,
    0.1,
    1,
    5,
    10,
    100)

tune.out <- tune(
  svm,
  y ~ x.casual+x.atemp,
  data = c.bike,
  kernel = "linear",
  ranges = list(cost = cost_range)
)
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0 
## 
## - Detailed performance results:
##    cost        error   dispersion
## 1 1e-02 0.0104725354 0.0025566324
## 2 1e-01 0.0002301827 0.0002971646
## 3 1e+00 0.0000000000 0.0000000000
## 4 5e+00 0.0000000000 0.0000000000
## 5 1e+01 0.0000000000 0.0000000000
## 6 1e+02 0.0000000000 0.0000000000

As can be seen, with a cost factor of 1, an error rate of 0 is already possible.

svm.bike.1.best <- tune.out$best.model
summary(svm.bike.1.best)
## 
## Call:
## best.tune(method = svm, train.x = y ~ x.casual + x.atemp, data = c.bike, 
##     ranges = list(cost = cost_range), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  366
## 
##  ( 184 182 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
plot(svm.bike.1.best, c.bike, x.casual~x.atemp)

The result of the best model shows us this. The best model has an optimal cost factor of 1 and consequently there are less support vectors. It is a little more than half of the first model.

table(predict = predict(svm.bike.1.best, c.bike),
      truth = c.bike$y)
##        truth
## predict     0     1
##       0 11148     0
##       1     0  6231

As the table shows, no errors are made in the prediction and the data is correctly classified.

5-class linear SVM with the developed model

In the second section of this chapter, a classification into five classes based on the final.model.1 will be made. The same classes are used that were already generated in the “Classification Tree” section. Due to the multi-dimensionality (more than 3 predictors), the SVM classification cannot be visualized.

To be able to fit an SVM model, the two data sets must be transformed.The class is removed from the x-variables and assigned as factor to the y-variable, i.e. the target variable.

d.bike.train.svm <- data.frame(x=subset(d.bike.train.new, select=-c(class)), y=as.factor(d.bike.train.new$class))
d.bike.test.svm <- data.frame(x=subset(d.bike.test.new, select=-c(class)), y=as.factor(d.bike.test.new$class))

Now the model can be fitted. The cost factor is deliberately set low, so that errors in the classification occur first, which are then to be eliminated with the tuning. At this point it should also be noted that the model in this example is now fitted with the training data record.

svm.bike.2 <- svm(y ~ .,
                  data = d.bike.train.svm,
                  kernel = "linear",
                  cost = 0.01,
                  metric = "RMSE",
                  scale = FALSE,
)
summary(svm.bike.2)
## 
## Call:
## svm(formula = y ~ ., data = d.bike.train.svm, kernel = "linear", 
##     cost = 0.01, metric = "RMSE", , scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  154
## 
##  ( 43 32 59 16 4 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  0 1 2 3 4

Now the classes are predicted once on the training and once on the test data set.

predict.svm.bike.2.train <- predict(svm.bike.2, d.bike.train.svm)
table(predict.svm.bike.2.train, d.bike.train.svm$y)
##                         
## predict.svm.bike.2.train    0    1    2    3    4
##                        0 8011    0    0    0    0
##                        1    0 3208    0    0    0
##                        2    0    0 1254    5    0
##                        3    0    0    0  413    0
##                        4    0    0    0    0  143
predict.svm.bike.2.test <- predict(svm.bike.2, d.bike.test.svm)
table(predict.svm.bike.2.test, d.bike.test.svm$y)
##                        
## predict.svm.bike.2.test    0    1    2    3    4
##                       0 2631    0    0    0    0
##                       1    0 1133    0    0    0
##                       2    0    0  402    2    0
##                       3    0    0    0  142    0
##                       4    0    0    0    0   35

Errors are made on both data sets. These must now be eliminated. To do this, a cost range is initialized as in the first example. The model is again optimized using the tune function.

cost_range <-
  c(0.01,
    0.1,
    1,
    5,
    10,
    100,
    1000)

tune.out <- tune(
  svm,
  y~.,
  data = d.bike.train.svm,
  kernel = "linear",
  ranges = list(cost = cost_range),
  scale = FALSE
  )
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.0001534919 
## 
## - Detailed performance results:
##    cost        error   dispersion
## 1 1e-02 0.0008439703 0.0009872952
## 2 1e-01 0.0004602404 0.0006467446
## 3 1e+00 0.0001534919 0.0003235894
## 4 5e+00 0.0001534919 0.0003235894
## 5 1e+01 0.0001534919 0.0003235894
## 6 1e+02 0.0001534919 0.0003235894
## 7 1e+03 0.0001534919 0.0003235894

Also in this example, a cost value of 1 seems to be sufficient to obtain an error rate of 0.

svm.bike.2.best <- tune.out$best.model
summary(svm.bike.2.best)
## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = d.bike.train.svm, 
##     ranges = list(cost = cost_range), kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  58
## 
##  ( 15 13 23 5 2 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  0 1 2 3 4

The optimal model shows the cost value 1. In this classification “only” 59 support vectors are generated. These are distributed very differently in the classes. The middle class 2 contains only one support vector.

Subsequently, the classes are to be predicted once on the training data set and once on the test data set. Based on the ratio of correct and incorrect classifications, the performance on both data sets will be evaluated.

predict.svm.bike.2.best.train <- predict(svm.bike.2.best, d.bike.train.svm)
table(predict.svm.bike.2.best.train, d.bike.train.svm$y)
##                              
## predict.svm.bike.2.best.train    0    1    2    3    4
##                             0 8011    0    0    0    0
##                             1    0 3208    0    0    0
##                             2    0    0 1254    0    0
##                             3    0    0    0  418    0
##                             4    0    0    0    0  143
corrects=sum(predict.svm.bike.2.best.train==d.bike.train.svm$y)
errors=sum(predict.svm.bike.2.best.train!=d.bike.train.svm$y)
(performance_train=corrects/(corrects+errors))
## [1] 1

A performance rate of 100% is achieved on the training data record. This can already be seen in the table. All data is assigned to the correct classes.

predict.svm.bike.2.best.test <- predict(svm.bike.2.best, d.bike.test.svm)
table(predict.svm.bike.2.best.test, d.bike.test.svm$y)
##                             
## predict.svm.bike.2.best.test    0    1    2    3    4
##                            0 2631    0    0    0    0
##                            1    0 1133    0    0    0
##                            2    0    0  401    0    0
##                            3    0    0    1  144    0
##                            4    0    0    0    0   35
corrects=sum(predict.svm.bike.2.best.test==d.bike.test.svm$y)
errors=sum(predict.svm.bike.2.best.test!=d.bike.test.svm$y)
(performance_test=corrects/(corrects+errors))
## [1] 0.9997699

The same result is shown on the test data set. The data is correctly classified using a new data set. An SVM can make good predictions for the two classification problems presented and the selected data set.

6. Neural Networks

This last chapter creats the neuronal network prediction model train and test.

displayDigit <- function(X){
  m <- matrix(unlist(X),nrow = 28,byrow = T)
  m <- t(apply(m, 2, rev))
  image(m,col=grey.colors(255))
}

set.seed(400)
(ref=sample(1:dim(d.bike.new)[1], 1, replace=F))
## [1] 9829
displayDigit(d.bike.new[ref,-1])
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]

set.seed(13)
training <- d.bike.train.new[,-1]
test <- d.bike.test.new
test_total <- d.bike.new

h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 days 10 hours 
##     H2O cluster timezone:       Europe/Zurich 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.30.0.1 
##     H2O cluster version age:    2 months and 7 days  
##     H2O cluster name:           H2O_started_from_R_taacece1_pal554 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.27 GB 
##     H2O cluster total cores:    16 
##     H2O cluster allowed cores:  16 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.2 (2019-12-12)
train_hf <- as.h2o(training)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
set.seed(105)

model_dl <- h2o.deeplearning(y = 4, training_frame = train_hf, nfolds = 10, seed=500, verbose = F)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
# prediction on TRAIN
predicted_h20_deep_neural_network_probabilites_train <- h2o.predict(model_dl, train_hf)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#table( as.data.frame(predicted_h20_deep_neural_network_probabilites_train$predict))
predicted_h20_deep_neural_network_class_train <- as.data.frame(predicted_h20_deep_neural_network_probabilites_train$predict)
#table(training$cnt, predicted_h20_deep_neural_network_class_train[,1])
predicted_h20_deep_neural_network_performance_train <- mean(predicted_h20_deep_neural_network_class_train[,1] == training$cnt)*100
print(paste('training accuracy: ',predicted_h20_deep_neural_network_performance_train,"%"))
## [1] "training accuracy:  1.01273592143624 %"
#prediction on TEST
test_hf <- as.h2o(test)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
predicted_h20_deep_neural_network_probabilites_test <- h2o.predict(model_dl, test_hf)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#table( as.data.frame(predicted_h20_deep_neural_network_probabilites_test$predict))
predicted_h20_deep_neural_network_class_test <- as.data.frame(predicted_h20_deep_neural_network_probabilites_test$predict)
#table(test$cnt, as.integer(predicted_h20_deep_neural_network_class_test[,1]))
predicted_h20_deep_neural_network_performance_test <- mean(predicted_h20_deep_neural_network_class_test[,1] == test$cnt)*100
print(paste('test accuracy: ',predicted_h20_deep_neural_network_performance_test,"%"))
## [1] "test accuracy:  1.33486766398159 %"
set.seed(400)
(ref_test=sample(1:dim(test)[1], 4, replace=F))
## [1] 1637 3958 3580 1756
for (idx in ref_test){
  real_class =test[idx,1]
  prediction = predicted_h20_deep_neural_network_class_test[idx,1]
  correct = (as.integer(real_class)==as.integer(prediction))
  print(paste('Index ',idx,' class: ',(as.integer(real_class)-1)," --> predicted: ",prediction," [correct = ",correct,"]"))
  displayDigit(test_total[idx,-1])
  i <- readline(prompt="Press [enter] to continue")
}
## [1] "Index  1637  class:  3  --> predicted:  1  [correct =  FALSE ]"
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]
## Press [enter] to continue
## [1] "Index  3958  class:  3  --> predicted:  1  [correct =  FALSE ]"
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]

## Press [enter] to continue
## [1] "Index  3580  class:  2  --> predicted:  2  [correct =  FALSE ]"
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]

## Press [enter] to continue
## [1] "Index  1756  class:  3  --> predicted:  1  [correct =  FALSE ]"
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]

## Press [enter] to continue

7. Conclusion

The purpose of this chapter is to compare all the models that have been used in this work to analyze and predict bike rentals. The RMSE is used as comparison value respectively measure of fit. To guarantee the fairness of the comparison, a 10-fold cross validation is applied.

lm.bike.1.rmse <- sqrt(mean(lm.bike.1$residuals^2))
gam.bike.1.rmse <- sqrt(mean(gam.bike.1$residuals^2))
poi.bike.1.rmse <- sqrt(mean(poi.bike.1$residuals^2))

#regression.tree.rmse <- mean((d.bike.test.new["cnt"]-tree.regression.bike.2.pred)$cnt)
#classification.tree.rmse <- mean((d.bike.test.new["class"]-tree.classification.bike.pred.test)$class)

bagging.rmse <- mean(yhat.bag-d.bike.test.new$cnt)
randomforest.rmse <- NA
boosting.rmse <- mean(yhat.boost -d.bike.test.new$cnt)

#svm.rmse <- rmse(svm.bike.2.best, d.bike.test.svm$y)